1. 05 6月, 2019 1 次提交
  2. 28 5月, 2019 1 次提交
    • T
      KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · a86cb413
      Thomas Huth 提交于
      KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
      architectures. However, on s390x, the amount of usable CPUs is determined
      during runtime - it is depending on the features of the machine the code
      is running on. Since we are using the vcpu_id as an index into the SCA
      structures that are defined by the hardware (see e.g. the sca_add_vcpu()
      function), it is not only the amount of CPUs that is limited by the hard-
      ware, but also the range of IDs that we can use.
      Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
      So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
      code into the architecture specific code, and on s390x we have to return
      the same value here as for KVM_CAP_MAX_VCPUS.
      This problem has been discovered with the kvm_create_max_vcpus selftest.
      With this change applied, the selftest now passes on s390x, too.
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NThomas Huth <thuth@redhat.com>
      Message-Id: <20190523164309.13345-9-thuth@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      a86cb413
  3. 25 5月, 2019 15 次提交
    • P
      KVM: x86: fix return value for reserved EFER · 66f61c92
      Paolo Bonzini 提交于
      Commit 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for
      host-initiated writes", 2019-04-02) introduced a "return false" in a
      function returning int, and anyway set_efer has a "nonzero on error"
      conventon so it should be returning 1.
      Reported-by: NPavel Machek <pavel@denx.de>
      Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66f61c92
    • P
      KVM: x86/pmu: do not mask the value that is written to fixed PMUs · 2924b521
      Paolo Bonzini 提交于
      According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
      each MSR may be written with any value, and the high-order 8 bits are
      sign-extended according to the value of bit 31", but the fixed counters
      in real hardware are limited to the width of the fixed counters ("bits
      beyond the width of the fixed-function counter are reserved and must be
      written as zeros").  Fix KVM to do the same.
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2924b521
    • P
      KVM: x86/pmu: mask the result of rdpmc according to the width of the counters · 0e6f467e
      Paolo Bonzini 提交于
      This patch will simplify the changes in the next, by enforcing the
      masking of the counters to RDPMC and RDMSR.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e6f467e
    • B
      x86/kvm/pmu: Set AMD's virt PMU version to 1 · a80c4ec1
      Borislav Petkov 提交于
      After commit:
      
        672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
      
      my AMD guests started #GPing like this:
      
        general protection fault: 0000 [#1] PREEMPT SMP
        CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
        RIP: 0010:x86_perf_event_update+0x3b/0xa0
      
      with Code: pointing to RDPMC. It is RDPMC because the guest has the
      hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses
      perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to:
      
        if (!pmu->version)
        	return 1;
      
      which the above commit added. Since AMD's PMU leaves the version at 0,
      that causes the #GP injection into the guest.
      
      Set pmu->version arbitrarily to 1 and move it above the non-applicable
      struct kvm_pmu members.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Mihai Carabas <mihai.carabas@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: stable@vger.kernel.org
      Fixes: 672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a80c4ec1
    • P
      KVM: x86: do not spam dmesg with VMCS/VMCB dumps · 6f2f8453
      Paolo Bonzini 提交于
      Userspace can easily set up invalid processor state in such a way that
      dmesg will be filled with VMCS or VMCB dumps.  Disable this by default
      using a module parameter.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6f2f8453
    • P
      kvm: Check irqchip mode before assign irqfd · 654f1f13
      Peter Xu 提交于
      When assigning kvm irqfd we didn't check the irqchip mode but we allow
      KVM_IRQFD to succeed with all the irqchip modes.  However it does not
      make much sense to create irqfd even without the kernel chips.  Let's
      provide a arch-dependent helper to check whether a specific irqfd is
      allowed by the arch.  At least for x86, it should make sense to check:
      
      - when irqchip mode is NONE, all irqfds should be disallowed, and,
      
      - when irqchip mode is SPLIT, irqfds that are with resamplefd should
        be disallowed.
      
      For either of the case, previously we'll silently ignore the irq or
      the irq ack event if the irqchip mode is incorrect.  However that can
      cause misterious guest behaviors and it can be hard to triage.  Let's
      fail KVM_IRQFD even earlier to detect these incorrect configurations.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Radim Krčmář <rkrcmar@redhat.com>
      CC: Alex Williamson <alex.williamson@redhat.com>
      CC: Eduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      654f1f13
    • S
      kvm: svm/avic: fix off-by-one in checking host APIC ID · c9bcd3e3
      Suthikulpanit, Suravee 提交于
      Current logic does not allow VCPU to be loaded onto CPU with
      APIC ID 255. This should be allowed since the host physical APIC ID
      field in the AVIC Physical APIC table entry is an 8-bit value,
      and APIC ID 255 is valid in system with x2APIC enabled.
      Instead, do not allow VCPU load if the host APIC ID cannot be
      represented by an 8-bit value.
      
      Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
      instead of AVIC_MAX_PHYSICAL_ID_COUNT.
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9bcd3e3
    • W
      KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace · 16ba3ab4
      Wanpeng Li 提交于
      Expose per-vCPU timer_advance_ns to userspace, so it is able to
      query the auto-adjusted value.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16ba3ab4
    • W
      KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow · 0e6edceb
      Wanpeng Li 提交于
      After commit c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of
      timer advancement), '-1' enables adaptive tuning starting from default
      advancment of 1000ns. However, we should expose an int instead of an overflow
      uint module parameter.
      
      Before patch:
      
      /sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295
      
      After patch:
      
      /sys/module/kvm/parameters/lapic_timer_advance_ns:-1
      
      Fixes: c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of timer advancement)
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e6edceb
    • Y
      kvm: vmx: Fix -Wmissing-prototypes warnings · 4d259965
      Yi Wang 提交于
      We get a warning when build kernel W=1:
      arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes]
       void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)
      
      Add the missing declaration to fix this.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d259965
    • W
      KVM: nVMX: Fix using __this_cpu_read() in preemptible context · 541e886f
      Wanpeng Li 提交于
       BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590
        caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
        CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G           OE     5.1.0-rc4+ #1
        Call Trace:
         dump_stack+0x67/0x95
         __this_cpu_preempt_check+0xd2/0xe0
         nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
         nested_vmx_run+0xda/0x2b0 [kvm_intel]
         handle_vmlaunch+0x13/0x20 [kvm_intel]
         vmx_handle_exit+0xbd/0x660 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm]
         kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm]
         do_vfs_ioctl+0xa5/0x6e0
         ksys_ioctl+0x6d/0x80
         __x64_sys_ioctl+0x1a/0x20
         do_syscall_64+0x6f/0x6c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Accessing per-cpu variable should disable preemption, this patch extends the
      preemption disable region for __this_cpu_read().
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Fixes: 52017608 ("KVM: nVMX: add option to perform early consistency checks via H/W")
      Cc: stable@vger.kernel.org
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      541e886f
    • J
      kvm: x86: Include CPUID leaf 0x8000001e in kvm's supported CPUID · 382409b4
      Jim Mattson 提交于
      Kvm now supports extended CPUID functions through 0x8000001f.  CPUID
      leaf 0x8000001e is AMD's Processor Topology Information leaf. This
      contains similar information to CPUID leaf 0xb (Intel's Extended
      Topology Enumeration leaf), and should be included in the output of
      KVM_GET_SUPPORTED_CPUID, even though userspace is likely to override
      some of this information based upon the configuration of the
      particular VM.
      
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Fixes: 8765d753 ("KVM: X86: Extend CPUID range to include new leaf")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMarc Orr <marcorr@google.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      382409b4
    • J
      kvm: x86: Include multiple indices with CPUID leaf 0x8000001d · 32a243df
      Jim Mattson 提交于
      Per the APM, "CPUID Fn8000_001D_E[D,C,B,A]X reports cache topology
      information for the cache enumerated by the value passed to the
      instruction in ECX, referred to as Cache n in the following
      description. To gather information for all cache levels, software must
      repeatedly execute CPUID with 8000_001Dh in EAX and ECX set to
      increasing values beginning with 0 until a value of 00h is returned in
      the field CacheType (EAX[4:0]) indicating no more cache descriptions
      are available for this processor."
      
      The termination condition is the same as leaf 4, so we can reuse that
      code block for leaf 0x8000001d.
      
      Fixes: 8765d753 ("KVM: X86: Extend CPUID range to include new leaf")
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMarc Orr <marcorr@google.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      32a243df
    • S
      KVM: nVMX: Clear nested_run_pending if setting nested state fails · 21be4ca1
      Sean Christopherson 提交于
      VMX's nested_run_pending flag is subtly consumed when stuffing state to
      enter guest mode, i.e. needs to be set according before KVM knows if
      setting guest state is successful.  If setting guest state fails, clear
      the flag as a nested run is obviously not pending.
      Reported-by: NAaron Lewis <aaronlewis@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      21be4ca1
    • P
      KVM: nVMX: really fix the size checks on KVM_SET_NESTED_STATE · db80927e
      Paolo Bonzini 提交于
      The offset for reading the shadow VMCS is sizeof(*kvm_state)+VMCS12_SIZE,
      so the correct size must be that plus sizeof(*vmcs12).  This could lead
      to KVM reading garbage data from userspace and not reporting an error,
      but is otherwise not sensitive.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db80927e
  4. 16 5月, 2019 3 次提交
    • S
      Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" · f93f7ede
      Sean Christopherson 提交于
      The RDPMC-exiting control is dependent on the existence of the RDPMC
      instruction itself, i.e. is not tied to the "Architectural Performance
      Monitoring" feature.  For all intents and purposes, the control exists
      on all CPUs with VMX support since RDPMC also exists on all VCPUs with
      VMX supported.  Per Intel's SDM:
      
        The RDPMC instruction was introduced into the IA-32 Architecture in
        the Pentium Pro processor and the Pentium processor with MMX technology.
        The earlier Pentium processors have performance-monitoring counters, but
        they must be read with the RDMSR instruction.
      
      Because RDPMC-exiting always exists, KVM requires the control and refuses
      to load if it's not available.  As a result, hiding the PMU from a guest
      breaks nested virtualization if the guest attemts to use KVM.
      
      While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit
      check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization
      for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE
      checks, but before the counter referenced in ECX is checked for validity.
      
      In other words, the original KVM behavior of injecting a #GP was correct,
      and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP
      when the unit test guest (L3 in this case) executes RDPMC without
      RDPMC-exiting set in the unit test host (L2).
      
      This reverts commit e51bfdb6.
      
      Fixes: e51bfdb6 ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU")
      Reported-by: NDavid Hill <hilld@binarystorm.net>
      Cc: Saar Amar <saaramar@microsoft.com>
      Cc: Mihai Carabas <mihai.carabas@oracle.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f93f7ede
    • K
      kvm: x86: Fix L1TF mitigation for shadow MMU · 61455bf2
      Kai Huang 提交于
      Currently KVM sets 5 most significant bits of physical address bits
      reported by CPUID (boot_cpu_data.x86_phys_bits) for nonpresent or
      reserved bits SPTE to mitigate L1TF attack from guest when using shadow
      MMU. However for some particular Intel CPUs the physical address bits
      of internal cache is greater than physical address bits reported by
      CPUID.
      
      Use the kernel's existing boot_cpu_data.x86_cache_bits to determine the
      five most significant bits. Doing so improves KVM's L1TF mitigation in
      the unlikely scenario that system RAM overlaps the high order bits of
      the "real" physical address space as reported by CPUID. This aligns with
      the kernel's warnings regarding L1TF mitigation, e.g. in the above
      scenario the kernel won't warn the user about lack of L1TF mitigation
      if x86_cache_bits is greater than x86_phys_bits.
      
      Also initialize shadow_nonpresent_or_rsvd_mask explicitly to make it
      consistent with other 'shadow_{xxx}_mask', and opportunistically add a
      WARN once if KVM's L1TF mitigation cannot be applied on a system that
      is marked as being susceptible to L1TF.
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NKai Huang <kai.huang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61455bf2
    • S
      KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible · d69129b4
      Sean Christopherson 提交于
      If L1 is using an MSR bitmap, unconditionally merge the MSR bitmaps from
      L0 and L1 for MSR_{KERNEL,}_{FS,GS}_BASE.  KVM unconditionally exposes
      MSRs L1.  If KVM is also running in L1 then it's highly likely L1 is
      also exposing the MSRs to L2, i.e. KVM doesn't need to intercept L2
      accesses.
      
      Based on code from Jintack Lim.
      
      Cc: Jintack Lim <jintack@xxxxxxxxxxxxxxx>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@xxxxxxxxx>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d69129b4
  5. 15 5月, 2019 1 次提交
    • I
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny 提交于
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Reviewed-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
  6. 10 5月, 2019 1 次提交
    • V
      cpufreq: Call transition notifier only once for each policy · df24014a
      Viresh Kumar 提交于
      Currently, the notifiers are called once for each CPU of the policy->cpus
      cpumask. It would be more optimal if the notifier can be called only
      once and all the relevant information be provided to it. Out of the 23
      drivers that register for the transition notifiers today, only 4 of them
      do per-cpu updates and the callback for the rest can be called only once
      for the policy without any impact.
      
      This would also avoid multiple function calls to the notifier callbacks
      and reduce multiple iterations of notifier core's code (which does
      locking as well).
      
      This patch adds pointer to the cpufreq policy to the struct
      cpufreq_freqs, so the notifier callback has all the information
      available to it with a single call. The five drivers which perform
      per-cpu updates are updated to use the cpufreq policy. The freqs->cpu
      field is redundant now and is removed.
      
      Acked-by: David S. Miller <davem@davemloft.net> (sparc)
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      df24014a
  7. 08 5月, 2019 2 次提交
  8. 01 5月, 2019 16 次提交