1. 28 5月, 2019 1 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · a86cb413
      Thomas Huth 提交于
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      a86cb413
  2. 25 5月, 2019 2 次提交
  3. 10 5月, 2019 1 次提交
    • V
      cpufreq: Call transition notifier only once for each policy · df24014a
      Viresh Kumar 提交于
      Currently, the notifiers are called once for each CPU of the policy->cpus
      cpumask. It would be more optimal if the notifier can be called only
      once and all the relevant information be provided to it. Out of the 23
      drivers that register for the transition notifiers today, only 4 of them
      do per-cpu updates and the callback for the rest can be called only once
      for the policy without any impact.
      
      This would also avoid multiple function calls to the notifier callbacks
      and reduce multiple iterations of notifier core's code (which does
      locking as well).
      
      This patch adds pointer to the cpufreq policy to the struct
      cpufreq_freqs, so the notifier callback has all the information
      available to it with a single call. The five drivers which perform
      per-cpu updates are updated to use the cpufreq policy. The freqs->cpu
      field is redundant now and is removed.
      
      Acked-by: David S. Miller <davem@davemloft.net> (sparc)
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      df24014a
  4. 01 5月, 2019 8 次提交
  5. 19 4月, 2019 2 次提交
    • S
      KVM: lapic: Allow user to disable adaptive tuning of timer advancement · c3941d9e
      Sean Christopherson 提交于
      The introduction of adaptive tuning of lapic timer advancement did not
      allow for the scenario where userspace would want to disable adaptive
      tuning but still employ timer advancement, e.g. for testing purposes or
      to handle a use case where adaptive tuning is unable to settle on a
      suitable time.  This is epecially pertinent now that KVM places a hard
      threshold on the maximum advancment time.
      
      Rework the timer semantics to accept signed values, with a value of '-1'
      being interpreted as "use adaptive tuning with KVM's internal default",
      and any other value being used as an explicit advancement time, e.g. a
      time of '0' effectively disables advancement.
      
      Note, this does not completely restore the original behavior of
      lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
      which is necessary to support autotuning, userspace could adjust
      lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
      module params are snapshotted at vCPU creation, i.e. applying a new
      advancement effectively requires restarting a VM.
      
      Dynamically updating a running vCPU is possible, e.g. a helper could be
      added to retrieve the desired delay, choosing between the global module
      param and the per-VCPU value depending on whether or not auto-tuning is
      (globally) enabled, but introduces a great deal of complexity.  The
      wrapper itself is not complex, but understanding and documenting the
      effects of dynamically toggling auto-tuning and/or adjusting the timer
      advancement is nigh impossible since the behavior would be dependent on
      KVM's implementation as well as compiler optimizations.  In other words,
      providing stable behavior would require extremely careful consideration
      now and in the future.
      
      Given that the expected use of a manually-tuned timer advancement is to
      "tune once, run many", use the vastly simpler approach of recognizing
      changes to the module params only when creating a new vCPU.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3941d9e
    • S
      KVM: lapic: Track lapic timer advance per vCPU · 39497d76
      Sean Christopherson 提交于
      Automatically adjusting the globally-shared timer advancement could
      corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
      the advancement value.  That could be partially fixed by using a local
      variable for the arithmetic, but it would still be susceptible to a
      race when setting timer_advance_adjust_done.
      
      And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
      correct calibration for a given vCPU may not apply to all vCPUs.
      
      Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
      effectively violated when finding a stable advancement takes an extended
      amount of timer.
      
      Opportunistically change the definition of lapic_timer_advance_ns to
      a u32 so that it matches the style of struct kvm_timer.  Explicitly
      pass the param to kvm_create_lapic() so that it doesn't have to be
      exposed to lapic.c, thus reducing the probability of unintentionally
      using the global value instead of the per-vCPU value.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      39497d76
  6. 16 4月, 2019 9 次提交
    • P
      kvm: move KVM_CAP_NR_MEMSLOTS to common code · c110ae57
      Paolo Bonzini 提交于
      All architectures except MIPS were defining it in the same way,
      and memory slots are handled entirely by common code so there
      is no point in keeping the definition per-architecture.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c110ae57
    • S
      KVM: x86: Inject #GP if guest attempts to set unsupported EFER bits · 0a629563
      Sean Christopherson 提交于
      EFER.LME and EFER.NX are considered reserved if their respective feature
      bits are not advertised to the guest.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0a629563
    • S
      KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes · 11988499
      Sean Christopherson 提交于
      KVM allows userspace to violate consistency checks related to the
      guest's CPUID model to some degree.  Generally speaking, userspace has
      carte blanche when it comes to guest state so long as jamming invalid
      state won't negatively affect the host.
      
      Currently this is seems to be a non-issue as most of the interesting
      EFER checks are missing, e.g. NX and LME, but those will be added
      shortly.  Proactively exempt userspace from the CPUID checks so as not
      to break userspace.
      
      Note, the efer_reserved_bits check still applies to userspace writes as
      that mask reflects the host's capabilities, e.g. KVM shouldn't allow a
      guest to run with NX=1 if it has been disabled in the host.
      
      Fixes: d8017474 ("KVM: SVM: Only allow setting of EFER_SVME when CPUID SVM is set")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      11988499
    • H
      KVM: x86: fix warning Using plain integer as NULL pointer · be43c440
      Hariprasad Kelam 提交于
      Changed passing argument as "0 to NULL" which resolves below sparse warning
      
      arch/x86/kvm/x86.c:3096:61: warning: Using plain integer as NULL pointer
      Signed-off-by: NHariprasad Kelam <hariprasad.kelam@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be43c440
    • S
      KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels · b68f3cc7
      Sean Christopherson 提交于
      Invoking the 64-bit variation on a 32-bit kenrel will crash the guest,
      trigger a WARN, and/or lead to a buffer overrun in the host, e.g.
      rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and
      thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64.
      
      KVM allows userspace to report long mode support via CPUID, even though
      the guest is all but guaranteed to crash if it actually tries to enable
      long mode.  But, a pure 32-bit guest that is ignorant of long mode will
      happily plod along.
      
      SMM complicates things as 64-bit CPUs use a different SMRAM save state
      area.  KVM handles this correctly for 64-bit kernels, e.g. uses the
      legacy save state map if userspace has hid long mode from the guest,
      but doesn't fare well when userspace reports long mode support on a
      32-bit host kernel (32-bit KVM doesn't support 64-bit guests).
      
      Since the alternative is to crash the guest, e.g. by not loading state
      or explicitly requesting shutdown, unconditionally use the legacy SMRAM
      save state map for 32-bit KVM.  If a guest has managed to get far enough
      to handle SMIs when running under a weird/buggy userspace hypervisor,
      then don't deliberately crash the guest since there are no downsides
      (from KVM's perspective) to allow it to continue running.
      
      Fixes: 660a5d51 ("KVM: x86: save/load state on SMM switch")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b68f3cc7
    • S
      KVM: x86: Open code kvm_set_hflags · c5833c7a
      Sean Christopherson 提交于
      Prepare for clearing HF_SMM_MASK prior to loading state from the SMRAM
      save state map, i.e. kvm_smm_changed() needs to be called after state
      has been loaded and so cannot be done automatically when setting
      hflags from RSM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5833c7a
    • S
      KVM: x86: Load SMRAM in a single shot when leaving SMM · ed19321f
      Sean Christopherson 提交于
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
      when loading SMSTATE into architectural state, ideally RSM emulation
      itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
      architectural state.
      
      Ostensibly, the only motivation for having HF_SMM_MASK set throughout
      the loading of state from the SMRAM save state area is so that the
      memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
      all of the SMRAM save state area from guest memory at the beginning of
      RSM emulation, and load state from the buffer instead of reading guest
      memory one-by-one.
      
      This paves the way for clearing HF_SMM_MASK prior to loading state,
      and also aligns RSM with the enter_smm() behavior, which fills a
      buffer and writes SMRAM save state in a single go.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed19321f
    • W
      x86/kvm: move kvm_load/put_guest_xcr0 into atomic context · 1811d979
      WANG Chao 提交于
      guest xcr0 could leak into host when MCE happens in guest mode. Because
      do_machine_check() could schedule out at a few places.
      
      For example:
      
      kvm_load_guest_xcr0
      ...
      kvm_x86_ops->run(vcpu) {
        vmx_vcpu_run
          vmx_complete_atomic_exit
            kvm_machine_check
              do_machine_check
                do_memory_failure
                  memory_failure
                    lock_page
      
      In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
      out, host cpu has guest xcr0 loaded (0xff).
      
      In __switch_to {
           switch_fpu_finish
             copy_kernel_to_fpregs
               XRSTORS
      
      If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
      generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
      and tries to reinitialize fpu by restoring init fpu state. Same story as
      last #GP, except we get DOUBLE FAULT this time.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NWANG Chao <chao.wang@ucloud.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1811d979
    • B
      kvm: mmu: Fix overflow on kvm mmu page limit calculation · bc8a3d89
      Ben Gardon 提交于
      KVM bases its memory usage limits on the total number of guest pages
      across all memslots. However, those limits, and the calculations to
      produce them, use 32 bit unsigned integers. This can result in overflow
      if a VM has more guest pages that can be represented by a u32. As a
      result of this overflow, KVM can use a low limit on the number of MMU
      pages it will allocate. This makes KVM unable to map all of guest memory
      at once, prompting spurious faults.
      
      Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
      	introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bc8a3d89
  7. 13 4月, 2019 1 次提交
    • R
      x86/fpu: Defer FPU state load until return to userspace · 5f409e20
      Rik van Riel 提交于
      Defer loading of FPU state until return to userspace. This gives
      the kernel the potential to skip loading FPU state for tasks that
      stay in kernel mode, or for tasks that end up with repeated
      invocations of kernel_fpu_begin() & kernel_fpu_end().
      
      The fpregs_lock/unlock() section ensures that the registers remain
      unchanged. Otherwise a context switch or a bottom half could save the
      registers to its FPU context and the processor's FPU registers would
      became random if modified at the same time.
      
      KVM swaps the host/guest registers on entry/exit path. This flow has
      been kept as is. First it ensures that the registers are loaded and then
      saves the current (host) state before it loads the guest's registers. The
      swap is done at the very end with disabled interrupts so it should not
      change anymore before theg guest is entered. The read/save version seems
      to be cheaper compared to memcpy() in a micro benchmark.
      
      Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy().
      For kernel threads, this flag gets never cleared which avoids saving /
      restoring the FPU state for kernel threads and during in-kernel usage of
      the FPU registers.
      
       [
         bp: Correct and update commit message and fix checkpatch warnings.
         s/register/registers/ where it is used in plural.
         minor comment corrections.
         remove unused trace_x86_fpu_activate_state() TP.
       ]
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aubrey Li <aubrey.li@intel.com>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: kvm ML <kvm@vger.kernel.org>
      Cc: Nicolai Stange <nstange@suse.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yi Wang <wang.yi59@zte.com.cn>
      Link: https://lkml.kernel.org/r/20190403164156.19645-24-bigeasy@linutronix.de
      5f409e20
  8. 11 4月, 2019 1 次提交
    • S
      x86/fpu: Use a feature number instead of mask in two more helpers · abd16d68
      Sebastian Andrzej Siewior 提交于
      After changing the argument of __raw_xsave_addr() from a mask to
      number Dave suggested to check if it makes sense to do the same for
      get_xsave_addr(). As it turns out it does.
      
      Only get_xsave_addr() needs the mask to check if the requested feature
      is part of what is supported/saved and then uses the number again. The
      shift operation is cheaper compared to fls64() (find last bit set).
      Also, the feature number uses less opcode space compared to the mask. :)
      
      Make the get_xsave_addr() argument a xfeature number instead of a mask
      and fix up its callers.
      
      Furthermore, use xfeature_nr and xfeature_mask consistently.
      
      This results in the following changes to the kvm code:
      
        feature -> xfeature_mask
        index -> xfeature_nr
      Suggested-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: kvm ML <kvm@vger.kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
      abd16d68
  9. 29 3月, 2019 4 次提交
    • S
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson 提交于
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • X
      kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0
      Xiaoyao Li 提交于
      Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
      host doesn't suppot it. We should move it to array emulated_msrs from
      arry msrs_to_save, to report to userspace that guest support this msr.
      Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bdb76c0
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson 提交于
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • W
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang 提交于
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
  10. 21 2月, 2019 6 次提交
    • S
      Revert "KVM: x86: use the fast way to invalidate all pages" · 7390de1e
      Sean Christopherson 提交于
      Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all().
      Flushing all shadow entries is only done during VM teardown, i.e.
      kvm_arch_flush_shadow_all() is only called when the associated MM struct
      is being released or when the VM instance is being freed.
      
      Although the performance of teardown itself isn't critical, KVM should
      still voluntarily schedule to play nice with the rest of the kernel;
      but that can be done without the fast invalidate mechanism in a future
      patch.
      
      This reverts commit 6ca18b69.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7390de1e
    • S
      Revert "KVM: MMU: reclaim the zapped-obsolete page first" · 52d5dedc
      Sean Christopherson 提交于
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit 365c8868.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52d5dedc
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson 提交于
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15248258
    • B
      kvm: x86: Add memcg accounting to KVM allocations · 254272ce
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      
      There remain a few allocations which should be charged to the VM's
      cgroup but are not. In x86, they include:
      	vcpu->arch.pio_data
      There allocations are unaccounted in this patch because they are mapped
      to userspace, and accounting them to a cgroup causes problems. This
      should be addressed in a future patch.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      254272ce
    • L
      KVM: x86: Sync the pending Posted-Interrupts · 81b01667
      Luwei Kang 提交于
      Some Posted-Interrupts from passthrough devices may be lost or
      overwritten when the vCPU is in runnable state.
      
      The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
      be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
      but not running on physical CPU). If a posted interrupt coming at this
      time, the irq remmaping facility will set the bit of PIR (Posted
      Interrupt Requests) without ON (Outstanding Notification).
      So this interrupt can't be sync to APIC virtualization register and
      will not be handled by Guest because ON is zero.
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      [Eliminate the pi_clear_sn fast path. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      81b01667
    • P
      KVM: x86: cull apicv code when userspace irqchip is requested · f7589cca
      Paolo Bonzini 提交于
      Currently apicv_active can be true even if in-kernel LAPIC
      emulation is disabled.  Avoid this by properly initializing
      it in kvm_arch_vcpu_init, and then do not do anything to
      deactivate APICv when it is actually not used
      
      (Currently APICv is only deactivated by SynIC code that in turn
      is only reachable when in-kernel LAPIC is in use.  However, it is
      cleaner if kvm_vcpu_deactivate_apicv avoids relying on this.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7589cca
  11. 14 2月, 2019 1 次提交
    • L
      KVM: x86: Recompute PID.ON when clearing PID.SN · c112b5f5
      Luwei Kang 提交于
      Some Posted-Interrupts from passthrough devices may be lost or
      overwritten when the vCPU is in runnable state.
      
      The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
      be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but
      not running on physical CPU). If a posted interrupt comes at this time,
      the irq remapping facility will set the bit of PIR (Posted Interrupt
      Requests) but not ON (Outstanding Notification).  Then, the interrupt
      will not be seen by KVM, which always expects PID.ON=1 if PID.PIR=1
      as documented in the Intel processor SDM but not in the VT-d specification.
      To fix this, restore the invariant after PID.SN is cleared.
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c112b5f5
  12. 08 2月, 2019 1 次提交
    • P
      KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222) · 353c0956
      Paolo Bonzini 提交于
      Bugzilla: 1671930
      
      Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with
      memory operand, INVEPT, INVVPID) can incorrectly inject a page fault
      when passed an operand that points to an MMIO address.  The page fault
      will use uninitialized kernel stack memory as the CR2 and error code.
      
      The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR
      exit to userspace; however, it is not an easy fix, so for now just
      ensure that the error code and CR2 are zero.
      
      Embargoed until Feb 7th 2019.
      Reported-by: NFelix Wilhelm <fwilhelm@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      353c0956
  13. 26 1月, 2019 3 次提交
    • G
      KVM: x86: Mark expected switch fall-throughs · b2869f28
      Gustavo A. R. Silva 提交于
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2869f28
    • S
      KVM: x86: Fix PV IPIs for 32-bit KVM host · 1ed199a4
      Sean Christopherson 提交于
      The recognition of the KVM_HC_SEND_IPI hypercall was unintentionally
      wrapped in "#ifdef CONFIG_X86_64", causing 32-bit KVM hosts to reject
      any and all PV IPI requests despite advertising the feature.  This
      results in all KVM paravirtualized guests hanging during SMP boot due
      to IPIs never being delivered.
      
      Fixes: 4180bf1b ("KVM: X86: Implement "send IPI" hypercall")
      Cc: stable@vger.kernel.org
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ed199a4
    • A
      KVM: x86: Fix single-step debugging · 5cc244a2
      Alexander Popov 提交于
      The single-step debugging of KVM guests on x86 is broken: if we run
      gdb 'stepi' command at the breakpoint when the guest interrupts are
      enabled, RIP always jumps to native_apic_mem_write(). Then other
      nasty effects follow.
      
      Long investigation showed that on Jun 7, 2017 the
      commit c8401dda ("KVM: x86: fix singlestepping over syscall")
      introduced the kvm_run.debug corruption: kvm_vcpu_do_singlestep() can
      be called without X86_EFLAGS_TF set.
      
      Let's fix it. Please consider that for -stable.
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Cc: stable@vger.kernel.org
      Fixes: c8401dda ("KVM: x86: fix singlestepping over syscall")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5cc244a2