1. 17 12月, 2016 1 次提交
  2. 15 12月, 2016 2 次提交
  3. 13 12月, 2016 2 次提交
    • A
      x86/ldt: use vfree_atomic() to free ldt entries · 8d5341a6
      Andrey Ryabinin 提交于
      vfree() is going to use sleeping lock.  free_ldt_struct() may be called
      with disabled preemption, therefore we must use vfree_atomic() here.
      
      E.g. call trace:
      	vfree()
      	free_ldt_struct()
      	destroy_context_ldt()
      	__mmdrop()
      	finish_task_switch()
      	schedule_tail()
      	ret_from_fork()
      
      Link: http://lkml.kernel.org/r/1479474236-4139-7-git-send-email-hch@lst.deSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Jisheng Zhang <jszhang@marvell.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: John Dias <joaodias@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d5341a6
    • R
      mm: remove x86-only restriction of movable_node · 39fa104d
      Reza Arbab 提交于
      In commit c5320926 ("mem-hotplug: introduce movable_node boot
      option"), the memblock allocation direction is changed to bottom-up and
      then back to top-down like this:
      
      1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node().
      2. memblock_set_bottom_up(false), called by x86's numa_init().
      
      Even though (1) occurs in generic mm code, it is wrapped by #ifdef
      CONFIG_MOVABLE_NODE, which depends on X86_64.
      
      This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches,
      things will be unbalanced.  (1) will happen for them, but (2) will not.
      
      This toggle was added in the first place because x86 has a delay between
      adding memblocks and marking them as hotpluggable.  Since other arches
      do this marking either immediately or not at all, they do not require
      the bottom-up toggle.
      
      So, resolve things by moving (1) from cmdline_parse_movable_node() to
      x86's setup_arch(), immediately after the movable_node parameter has
      been parsed.
      
      Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alistair Popple <apopple@au1.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Frank Rowand <frowand.list@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Stewart Smith <stewart@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39fa104d
  4. 11 12月, 2016 3 次提交
    • P
      x86/paravirt: Fix bool return type for PVOP_CALL() · 11f254db
      Peter Zijlstra 提交于
      Commit:
      
        3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      
      introduced a paravirt op with bool return type [*]
      
      It turns out that the PVOP_CALL*() macros miscompile when rettype is
      bool. Code that looked like:
      
         83 ef 01                sub    $0x1,%edi
         ff 15 32 a0 d8 00       callq  *0xd8a032(%rip)        # ffffffff81e28120 <pv_lock_ops+0x20>
         84 c0                   test   %al,%al
      
      ended up looking like so after PVOP_CALL1() was applied:
      
         83 ef 01                sub    $0x1,%edi
         48 63 ff                movslq %edi,%rdi
         ff 14 25 20 81 e2 81    callq  *0xffffffff81e28120
         48 85 c0                test   %rax,%rax
      
      Note how it tests the whole of %rax, even though a typical bool return
      function only sets %al, like:
      
        0f 95 c0                setne  %al
        c3                      retq
      
      This is because ____PVOP_CALL() does:
      
      		__ret = (rettype)__eax;
      
      and while regular integer type casts truncate the result, a cast to
      bool tests for any !0 value. Fix this by explicitly truncating to
      sizeof(rettype) before casting.
      
      [*] The actual bug should've been exposed in commit:
            446f3dc8 ("locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests")
          but that didn't properly implement the paravirt call.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      Link: http://lkml.kernel.org/r/20161208154349.346057680@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11f254db
    • P
      x86/paravirt: Fix native_patch() · 45dbea5f
      Peter Zijlstra 提交于
      While chasing a regression I noticed we potentially patch the wrong
      code in native_patch().
      
      If we do not select the native code sequence, we must use the default
      patcher, not fall-through the switch case.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel test robot <xiaolong.ye@intel.com>
      Fixes: 3cded417 ("x86/paravirt: Optimize native pv_lock_ops.vcpu_is_preempted()")
      Link: http://lkml.kernel.org/r/20161208154349.270616999@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      45dbea5f
    • A
      perf/x86: Fix exclusion of BTS and LBR for Goldmont · b0c1ef52
      Andi Kleen 提交于
      An earlier patch allowed enabling PT and LBR at the same
      time on Goldmont. However it also allowed enabling BTS and LBR
      at the same time, which is still not supported. Fix this by
      bypassing the check only for PT.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alexander.shishkin@intel.com
      Cc: kan.liang@intel.com
      Cc: <stable@vger.kernel.org>
      Fixes: ccbebba4 ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it")
      Link: http://lkml.kernel.org/r/20161209001417.4713-1-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b0c1ef52
  5. 10 12月, 2016 7 次提交
    • T
      x86/ldt: Make all size computations unsigned · 990e9dc3
      Thomas Gleixner 提交于
      ldt->size can never be negative. The helper functions take 'unsigned int'
      arguments which are assigned from ldt->size. The related user space
      user_desc struct member entry_number is unsigned as well.
      
      But ldt->size itself and a few local variables which are related to
      ldt->size are type 'int' which makes no sense whatsoever and results in
      typecasts which make the eyes bleed.
      
      Clean it up and convert everything which is related to ldt->size to
      unsigned it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      990e9dc3
    • D
      x86/ldt: Make a size argument unsigned · 296dc580
      Dan Carpenter 提交于
      My static checker complains that we put an upper bound on the "size"
      argument but not a lower bound.  The checker is not smart enough to know
      the possible ranges of "old_mm->context.ldt->size" from
      init_new_context_ldt() so it thinks maybe it could be negative.
      
      Let's make it unsigned to silence the warning and future proof the code
      a bit.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: kernel-janitors@vger.kernel.org
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20161208105602.GA11382@elgon.mountainSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      296dc580
    • T
      x86: Remove empty idle.h header · 34bc3560
      Thomas Gleixner 提交于
      One include less is always a good thing(tm). Good riddance.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20161209182912.2726-6-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      34bc3560
    • B
      x86/amd: Simplify AMD E400 aware idle routine · 07c94a38
      Borislav Petkov 提交于
      Reorganize the E400 detection now that we have everything in place:
      switch the CPUs to broadcast mode after the LAPIC has been initialized
      and remove the facilities that were used previously on the idle path.
      
      Unfortunately static_cpu_has_bug() cannpt be used in the E400 idle routine
      because alternatives have been applied when the actual detection happens,
      so the static switching does not take effect and the test will stay
      false. Use boot_cpu_has_bug() instead which is definitely an improvement
      over the RDMSR and the cpumask handling.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20161209182912.2726-5-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      07c94a38
    • T
      x86/amd: Check for the C1E bug post ACPI subsystem init · e7ff3a47
      Thomas Gleixner 提交于
      AMD CPUs affected by the E400 erratum suffer from the issue that the
      local APIC timer stops when the CPU goes into C1E. Unfortunately there
      is no way to detect the affected CPUs on early boot. It's only possible
      to determine the range of possibly affected CPUs from the family/model
      range.
      
      The actual decision whether to enter C1E and thus cause the bug is done
      by the firmware and we need to detect that case late, after ACPI has
      been initialized.
      
      The current solution is to check in the idle routine whether the CPU is
      affected by reading the MSR_K8_INT_PENDING_MSG MSR and checking for the
      K8_INTP_C1E_ACTIVE_MASK bits. If one of the bits is set then the CPU is
      affected and the system is switched into forced broadcast mode.
      
      This is ineffective and on non-affected CPUs every entry to idle does
      the extra RDMSR.
      
      After doing some research it turns out that the bits are visible on the
      boot CPU right after the ACPI subsystem is initialized in the early
      boot process. So instead of polling for the bits in the idle loop, add
      a detection function after acpi_subsystem_init() and check for the MSR
      bits. If set, then the X86_BUG_AMD_APIC_C1E is set on the boot CPU and
      the TSC is marked unstable when X86_FEATURE_NONSTOP_TSC is not set as it
      will stop in C1E state as well.
      
      The switch to broadcast mode cannot be done at this point because the
      boot CPU still uses HPET as a clockevent device and the local APIC timer
      is not yet calibrated and installed. The switch to broadcast mode on the
      affected CPUs needs to be done when the local APIC timer is actually set
      up.
      
      This allows to cleanup the amd_e400_idle() function in the next step.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20161209182912.2726-4-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e7ff3a47
    • T
      x86/bugs: Separate AMD E400 erratum and C1E bug · 3344ed30
      Thomas Gleixner 提交于
      The workaround for the AMD Erratum E400 (Local APIC timer stops in C1E
      state) is a two step process:
      
       - Selection of the E400 aware idle routine
      
       - Detection whether the platform is affected
      
      The idle routine selection happens for possibly affected CPUs depending on
      family/model/stepping information. These range of CPUs is not necessarily
      affected as the decision whether to enable the C1E feature is made by the
      firmware. Unfortunately there is no way to query this at early boot.
      
      The current implementation polls a MSR in the E400 aware idle routine to
      detect whether the CPU is affected. This is inefficient on non affected
      CPUs because every idle entry has to do the MSR read.
      
      There is a better way to detect this before going idle for the first time
      which requires to seperate the bug flags:
      
        X86_BUG_AMD_E400 	- Selects the E400 aware idle routine and
        			  enables the detection
      			  
        X86_BUG_AMD_APIC_C1E  - Set when the platform is affected by E400
      
      Replace the current X86_BUG_AMD_APIC_C1E usage by the new X86_BUG_AMD_E400
      bug bit to select the idle routine which currently does an unconditional
      detection poll. X86_BUG_AMD_APIC_C1E is going to be used in later patches
      to remove the MSR polling and simplify the handling of this misfeature.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20161209182912.2726-3-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3344ed30
    • B
      x86/cpufeature: Provide helper to set bugs bits · a588b983
      Borislav Petkov 提交于
      Will be used in a later patch to set bug bits for bugs which need late
      detection.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Link: http://lkml.kernel.org/r/20161209182912.2726-2-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a588b983
  6. 09 12月, 2016 3 次提交
  7. 08 12月, 2016 17 次提交
    • P
      KVM: x86: Handle the kthread worker using the new API · 36da91bd
      Petr Mladek 提交于
      Use the new API to create and destroy the "kvm-pit" kthread
      worker. The API hides some implementation details.
      
      In particular, kthread_create_worker() allocates and initializes
      struct kthread_worker. It runs the kthread the right way
      and stores task_struct into the worker structure.
      
      kthread_destroy_worker() flushes all pending works, stops
      the kthread and frees the structure.
      
      This patch does not change the existing behavior except for
      dynamically allocating struct kthread_worker and storing
      only the pointer of this structure.
      
      It is compile tested only because I did not find an easy
      way how to run the code. Well, it should be pretty safe
      given the nature of the change.
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Message-Id: <1476877847-11217-1-git-send-email-pmladek@suse.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      36da91bd
    • J
      KVM: nVMX: invvpid handling improvements · 16c2aec6
      Jan Dakinevich 提交于
       - Expose all invalidation types to the L1
      
       - Reject invvpid instruction, if L1 passed zero vpid value to single
         context invalidations
      Signed-off-by: NJan Dakinevich <jan.dakinevich@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16c2aec6
    • L
      KVM: nVMX: check host CR3 on vmentry and vmexit · 1dc35dac
      Ladi Prosek 提交于
      This commit adds missing host CR3 checks. Before entering guest mode, the value
      of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
      called to set the new CR3 value and check and load PDPTRs.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1dc35dac
    • L
      KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry · 9ed38ffa
      Ladi Prosek 提交于
      Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
      as implemented in kvm_set_cr3, in several ways.
      
      * different rules are followed to check CR3 and it is desirable for the caller
      to distinguish between the possible failures
      * PDPTRs are not loaded if PAE paging and nested EPT are both enabled
      * many MMU operations are not necessary
      
      This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
      nested vmentry and vmexit, and makes use of it on the nested vmentry path.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ed38ffa
    • L
      KVM: nVMX: propagate errors from prepare_vmcs02 · ee146c1c
      Ladi Prosek 提交于
      It is possible that prepare_vmcs02 fails to load the guest state. This
      patch adds the proper error handling for such a case. L1 will receive
      an INVALID_STATE vmexit with the appropriate exit qualification if it
      happens.
      
      A failure to set guest CR3 is the only error propagated from prepare_vmcs02
      at the moment.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ee146c1c
    • L
      KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT · 7ca29de2
      Ladi Prosek 提交于
      KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
      PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
      PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
      (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
      dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
      related issues:
      
      1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
      overwritten in ept_load_pdptrs because the registers are believed to have
      been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
      running with PAE enabled but PDPTRs have been set up by L1.
      
      2) When L2 is about to enable paging and loads its CR3, we, again, attempt
      to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
      that this will succeed (it's just a CR3 load, paging is not enabled yet) and
      if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
      then lost and L2 crashes right after it enables paging.
      
      This patch replaces the kvm_set_cr3 call with a simple register write if PAE
      and EPT are both on. CR3 is not to be interpreted in this case.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7ca29de2
    • D
      KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry · 5a6a9748
      David Matlack 提交于
      vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
      VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
      is more faithful to VMCS12.
      
      This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
      1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
      "IA-32e mode guest" would silently be disabled by KVM.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      5a6a9748
    • D
      KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID · 8322ebbb
      David Matlack 提交于
      MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
      be 1 during VMX operation. Since the set of allowed-1 bits is the same
      in and out of VMX operation, we can generate these MSRs entirely from
      the guest's CPUID. This lets userspace avoiding having to save/restore
      these MSRs.
      
      This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
      by default. This is a saner than the current default of -1ull, which
      includes bits that the host CPU does not support.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8322ebbb
    • D
      KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation · 3899152c
      David Matlack 提交于
      KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
      all CR0 and CR4 bits are allowed to be 1 during VMX operation.
      
      This does not match real hardware, which disallows the high 32 bits of
      CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
      which are defined in the SDM but missing according to CPUID). A guest
      can induce a VM-entry failure by setting these bits in GUEST_CR0 and
      GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
      valid.
      
      Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
      checks on these registers do not verify must-be-0 bits. Fix these checks
      to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.
      
      This patch should introduce no change in behavior in KVM, since these
      MSRs are still -1ULL.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      3899152c
    • D
      KVM: nVMX: support restore of VMX capability MSRs · 62cc6b9d
      David Matlack 提交于
      The VMX capability MSRs advertise the set of features the KVM virtual
      CPU can support. This set of features varies across different host CPUs
      and KVM versions. This patch aims to addresses both sources of
      differences, allowing VMs to be migrated across CPUs and KVM versions
      without guest-visible changes to these MSRs. Note that cross-KVM-
      version migration is only supported from this point forward.
      
      When the VMX capability MSRs are restored, they are audited to check
      that the set of features advertised are a subset of what KVM and the
      CPU support.
      
      Since the VMX capability MSRs are read-only, they do not need to be on
      the default MSR save/restore lists. The userspace hypervisor can set
      the values of these MSRs or read them from KVM at VCPU creation time,
      and restore the same value after every save/restore.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      62cc6b9d
    • D
      KVM: nVMX: generate non-true VMX MSRs based on true versions · 0115f9cb
      David Matlack 提交于
      The "non-true" VMX capability MSRs can be generated from their "true"
      counterparts, by OR-ing the default1 bits. The default1 bits are fixed
      and defined in the SDM.
      
      Since we can generate the non-true VMX MSRs from the true versions,
      there's no need to store both in struct nested_vmx. This also lets
      userspace avoid having to restore the non-true MSRs.
      
      Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
      we simply need to set all the default1 bits in the true MSRs (such that
      the true MSRs and the generated non-true MSRs are equal).
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0115f9cb
    • K
      KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs. · ea07e42d
      Kyle Huey 提交于
      The trap flag stays set until software clears it.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ea07e42d
    • K
      KVM: x86: Add kvm_skip_emulated_instruction and use it. · 6affcbed
      Kyle Huey 提交于
      kvm_skip_emulated_instruction calls both
      kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
      skipping the emulated instruction and generating a trap if necessary.
      
      Replacing skip_emulated_instruction calls with
      kvm_skip_emulated_instruction is straightforward, except for:
      
      - ICEBP, which is already inside a trap, so avoid triggering another trap.
      - Instructions that can trigger exits to userspace, such as the IO insns,
        MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
        KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
        IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
        take precedence. The singlestep will be triggered again on the next
        instruction, which is the current behavior.
      - Task switch instructions which would require additional handling (e.g.
        the task switch bit) and are instead left alone.
      - Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
        which do not trigger singlestep traps as mentioned previously.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6affcbed
    • K
      KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12 · eb277562
      Kyle Huey 提交于
      We can't return both the pass/fail boolean for the vmcs and the upcoming
      continue/exit-to-userspace boolean for skip_emulated_instruction out of
      nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.
      
      Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
      they advance the IP to the following instruction, not when they a) succeed,
      b) fail MSR validation or c) throw an exception. Add a separate call to
      skip_emulated_instruction that will later not be converted to the variant
      that checks the singlestep flag.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      eb277562
    • K
      KVM: VMX: Reorder some skip_emulated_instruction calls · 09ca3f20
      Kyle Huey 提交于
      The functions being moved ahead of skip_emulated_instruction here don't
      need updated IPs, and skipping the emulated instruction at the end will
      make it easier to return its value.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      09ca3f20
    • K
      KVM: x86: Add a return value to kvm_emulate_cpuid · 6a908b62
      Kyle Huey 提交于
      Once skipping the emulated instruction can potentially trigger an exit to
      userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
      propagate a return value.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6a908b62
    • K
      xen/pci: Bubble up error and fix description. · 577f79e4
      Konrad Rzeszutek Wilk 提交于
      The function is never called under PV guests, and only shows up
      when MSI (or MSI-X) cannot be allocated. Convert the message
      to include the error value.
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      577f79e4
  8. 06 12月, 2016 4 次提交
    • P
      x86/uaccess, sched/preempt: Verify access_ok() context · 7c478895
      Peter Zijlstra 提交于
      I recently encountered wreckage because access_ok() was used where it
      should not be, add an explicit WARN when access_ok() is used wrongly.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7c478895
    • P
      perf/x86: Fix full width counter, counter overflow · 7f612a7f
      Peter Zijlstra (Intel) 提交于
      Lukasz reported that perf stat counters overflow handling is broken on KNL/SLM.
      
      Both these parts have full_width_write set, and that does indeed have
      a problem. In order to deal with counter wrap, we must sample the
      counter at at least half the counter period (see also the sampling
      theorem) such that we can unambiguously reconstruct the count.
      
      However commit:
      
        069e0c3c ("perf/x86/intel: Support full width counting")
      
      sets the sampling interval to the full period, not half.
      
      Fixing that exposes another issue, in that we must not sign extend the
      delta value when we shift it right; the counter cannot have
      decremented after all.
      
      With both these issues fixed, counter overflow functions correctly
      again.
      Reported-by: NLukasz Odzioba <lukasz.odzioba@intel.com>
      Tested-by: NLiang, Kan <kan.liang@intel.com>
      Tested-by: NOdzioba, Lukasz <lukasz.odzioba@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: stable@vger.kernel.org
      Fixes: 069e0c3c ("perf/x86/intel: Support full width counting")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7f612a7f
    • P
      perf/x86/intel: Enable C-state residency events for Knights Mill · 1dba23b1
      Piotr Luc 提交于
      The Knights Mill is enough close to Knights Landing so the path reuses
      C-state residency support of the latter.
      Signed-off-by: NPiotr Luc <piotr.luc@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Link: http://lkml.kernel.org/r/20161201000853.18260-1-piotr.luc@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1dba23b1
    • J
      x86/suspend: fix false positive KASAN warning on suspend/resume · b53f40db
      Josh Poimboeuf 提交于
      Resuming from a suspend operation is showing a KASAN false positive
      warning:
      
        BUG: KASAN: stack-out-of-bounds in unwind_get_return_address+0x11d/0x130 at addr ffff8803867d7878
        Read of size 8 by task pm-suspend/7774
        page:ffffea000e19f5c0 count:0 mapcount:0 mapping:          (null) index:0x0
        flags: 0x2ffff0000000000()
        page dumped because: kasan: bad access detected
        CPU: 0 PID: 7774 Comm: pm-suspend Tainted: G    B           4.9.0-rc7+ #8
        Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F5 03/07/2016
        Call Trace:
          dump_stack+0x63/0x82
          kasan_report_error+0x4b4/0x4e0
          ? acpi_hw_read_port+0xd0/0x1ea
          ? kfree_const+0x22/0x30
          ? acpi_hw_validate_io_request+0x1a6/0x1a6
          __asan_report_load8_noabort+0x61/0x70
          ? unwind_get_return_address+0x11d/0x130
          unwind_get_return_address+0x11d/0x130
          ? unwind_next_frame+0x97/0xf0
          __save_stack_trace+0x92/0x100
          save_stack_trace+0x1b/0x20
          save_stack+0x46/0xd0
          ? save_stack_trace+0x1b/0x20
          ? save_stack+0x46/0xd0
          ? kasan_kmalloc+0xad/0xe0
          ? kasan_slab_alloc+0x12/0x20
          ? acpi_hw_read+0x2b6/0x3aa
          ? acpi_hw_validate_register+0x20b/0x20b
          ? acpi_hw_write_port+0x72/0xc7
          ? acpi_hw_write+0x11f/0x15f
          ? acpi_hw_read_multiple+0x19f/0x19f
          ? memcpy+0x45/0x50
          ? acpi_hw_write_port+0x72/0xc7
          ? acpi_hw_write+0x11f/0x15f
          ? acpi_hw_read_multiple+0x19f/0x19f
          ? kasan_unpoison_shadow+0x36/0x50
          kasan_kmalloc+0xad/0xe0
          kasan_slab_alloc+0x12/0x20
          kmem_cache_alloc_trace+0xbc/0x1e0
          ? acpi_get_sleep_type_data+0x9a/0x578
          acpi_get_sleep_type_data+0x9a/0x578
          acpi_hw_legacy_wake_prep+0x88/0x22c
          ? acpi_hw_legacy_sleep+0x3c7/0x3c7
          ? acpi_write_bit_register+0x28d/0x2d3
          ? acpi_read_bit_register+0x19b/0x19b
          acpi_hw_sleep_dispatch+0xb5/0xba
          acpi_leave_sleep_state_prep+0x17/0x19
          acpi_suspend_enter+0x154/0x1e0
          ? trace_suspend_resume+0xe8/0xe8
          suspend_devices_and_enter+0xb09/0xdb0
          ? printk+0xa8/0xd8
          ? arch_suspend_enable_irqs+0x20/0x20
          ? try_to_freeze_tasks+0x295/0x600
          pm_suspend+0x6c9/0x780
          ? finish_wait+0x1f0/0x1f0
          ? suspend_devices_and_enter+0xdb0/0xdb0
          state_store+0xa2/0x120
          ? kobj_attr_show+0x60/0x60
          kobj_attr_store+0x36/0x70
          sysfs_kf_write+0x131/0x200
          kernfs_fop_write+0x295/0x3f0
          __vfs_write+0xef/0x760
          ? handle_mm_fault+0x1346/0x35e0
          ? do_iter_readv_writev+0x660/0x660
          ? __pmd_alloc+0x310/0x310
          ? do_lock_file_wait+0x1e0/0x1e0
          ? apparmor_file_permission+0x18/0x20
          ? security_file_permission+0x73/0x1c0
          ? rw_verify_area+0xbd/0x2b0
          vfs_write+0x149/0x4a0
          SyS_write+0xd9/0x1c0
          ? SyS_read+0x1c0/0x1c0
          entry_SYSCALL_64_fastpath+0x1e/0xad
        Memory state around the buggy address:
         ffff8803867d7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         ffff8803867d7780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        >ffff8803867d7800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f4
                                                                        ^
         ffff8803867d7880: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
         ffff8803867d7900: 00 00 00 f1 f1 f1 f1 04 f4 f4 f4 f3 f3 f3 f3 00
      
      KASAN instrumentation poisons the stack when entering a function and
      unpoisons it when exiting the function.  However, in the suspend path,
      some functions never return, so their stack never gets unpoisoned,
      resulting in stale KASAN shadow data which can cause later false
      positive warnings like the one above.
      Reported-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      b53f40db
  9. 02 12月, 2016 1 次提交