1. 03 11月, 2016 1 次提交
    • P
      KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK · ea26e4ec
      Paolo Bonzini 提交于
      Since commit a545ab6a ("kvm: x86: add tsc_offset field to struct
      kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
      cached and need not be fished out of the VMCS or VMCB.  This means
      that we can implement adjust_tsc_offset_guest and read_l1_tsc
      entirely in generic code.  The simplification is particularly
      significant for VMX code, where vmx->nested.vmcs01_tsc_offset
      was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
      the vmcs01_tsc_offset can be dropped completely.
      
      More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
      which, after commit 108b249c ("KVM: x86: introduce get_kvmclock_ns",
      2016-09-01) called read_l1_tsc while the VMCS was not loaded.
      It thus returned bogus values on Intel CPUs.
      
      Fixes: 108b249cReported-by: NRoman Kagan <rkagan@virtuozzo.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ea26e4ec
  2. 28 10月, 2016 2 次提交
  3. 27 10月, 2016 1 次提交
  4. 20 10月, 2016 2 次提交
    • J
      kvm: x86: memset whole irq_eoi · 8678654e
      Jiri Slaby 提交于
      gcc 7 warns:
      arch/x86/kvm/ioapic.c: In function 'kvm_ioapic_reset':
      arch/x86/kvm/ioapic.c:597:2: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size]
      
      And it is right. Memset whole array using sizeof operator.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      [Added x86 subject tag]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8678654e
    • B
      kvm/x86: Fix unused variable warning in kvm_timer_init() · 758f588d
      Borislav Petkov 提交于
      When CONFIG_CPU_FREQ is not set, int cpu is unused and gcc rightfully
      warns about it:
      
        arch/x86/kvm/x86.c: In function ‘kvm_timer_init’:
        arch/x86/kvm/x86.c:5697:6: warning: unused variable ‘cpu’ [-Wunused-variable]
          int cpu;
              ^~~
      
      But since it is used only in the CONFIG_CPU_FREQ block, simply move it
      there, thus squashing the warning too.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      758f588d
  5. 12 10月, 2016 1 次提交
    • P
      kthread: kthread worker API cleanup · 3989144f
      Petr Mladek 提交于
      A good practice is to prefix the names of functions by the name
      of the subsystem.
      
      The kthread worker API is a mix of classic kthreads and workqueues.  Each
      worker has a dedicated kthread.  It runs a generic function that process
      queued works.  It is implemented as part of the kthread subsystem.
      
      This patch renames the existing kthread worker API to use
      the corresponding name from the workqueues API prefixed by
      kthread_:
      
      __init_kthread_worker()		-> __kthread_init_worker()
      init_kthread_worker()		-> kthread_init_worker()
      init_kthread_work()		-> kthread_init_work()
      insert_kthread_work()		-> kthread_insert_work()
      queue_kthread_work()		-> kthread_queue_work()
      flush_kthread_work()		-> kthread_flush_work()
      flush_kthread_worker()		-> kthread_flush_worker()
      
      Note that the names of DEFINE_KTHREAD_WORK*() macros stay
      as they are. It is common that the "DEFINE_" prefix has
      precedence over the subsystem names.
      
      Note that INIT() macros and init() functions use different
      naming scheme. There is no good solution. There are several
      reasons for this solution:
      
        + "init" in the function names stands for the verb "initialize"
          aka "initialize worker". While "INIT" in the macro names
          stands for the noun "INITIALIZER" aka "worker initializer".
      
        + INIT() macros are used only in DEFINE() macros
      
        + init() functions are used close to the other kthread()
          functions. It looks much better if all the functions
          use the same scheme.
      
        + There will be also kthread_destroy_worker() that will
          be used close to kthread_cancel_work(). It is related
          to the init() function. Again it looks better if all
          functions use the same naming scheme.
      
        + there are several precedents for such init() function
          names, e.g. amd_iommu_init_device(), free_area_init_node(),
          jump_label_init_type(),  regmap_init_mmio_clk(),
      
        + It is not an argument but it was inconsistent even before.
      
      [arnd@arndb.de: fix linux-next merge conflict]
       Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.comSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3989144f
  6. 23 9月, 2016 3 次提交
    • W
      KVM: nVMX: Fix the NMI IDT-vectoring handling · c5a6d5f7
      Wanpeng Li 提交于
      Run kvm-unit-tests/eventinj.flat in L1:
      
      Sending NMI to self
      After NMI to self
      FAIL: NMI
      
      This test scenario is to test whether VMM can handle NMI IDT-vectoring info correctly.
      
      At the beginning, L2 writes LAPIC to send a self NMI, the EPT page tables on both L1
      and L0 are empty so:
      
      - The L2 accesses memory can generate EPT violation which can be intercepted by L0.
      
        The EPT violation vmexit occurred during delivery of this NMI, and the NMI info is
        recorded in vmcs02's IDT-vectoring info.
      
      - L0 walks L1's EPT12 and L0 sees the mapping is invalid, it injects the EPT violation into L1.
      
        The vmcs02's IDT-vectoring info is reflected to vmcs12's IDT-vectoring info since
        it is a nested vmexit.
      
      - L1 receives the EPT violation, then fixes its EPT12.
      - L1 executes VMRESUME to resume L2 which generates vmexit and causes L1 exits to L0.
      - L0 emulates VMRESUME which is called from L1, then return to L2.
      
        L0 merges the requirement of vmcs12's IDT-vectoring info and injects it to L2 through
        vmcs02.
      
      - The L2 re-executes the fault instruction and cause EPT violation again.
      - Since the L1's EPT12 is valid, L0 can fix its EPT02
      - L0 resume L2
      
        The EPT violation vmexit occurred during delivery of this NMI again, and the NMI info
        is recorded in vmcs02's IDT-vectoring info. L0 should inject the NMI through vmentry
        event injection since it is caused by EPT02's EPT violation.
      
      However, vmx_inject_nmi() refuses to inject NMI from IDT-vectoring info if vCPU is in
      guest mode, this patch fix it by permitting to inject NMI from IDT-vectoring if it is
      the L0's responsibility to inject NMI from IDT-vectoring info to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Bandan Das <bsd@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c5a6d5f7
    • W
      KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive · f6e90f9e
      Wanpeng Li 提交于
      I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
      to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
      on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f
      x86, apicv: add virtual x2apic support) disable virtual x2apic mode
      completely if w/o APICv, and the author also told me that windows guest
      can't enter into x2apic mode when he developed the APICv feature several
      years ago. However, it is not truth currently, Interrupt Remapping and
      vIOMMU is added to qemu and the developers from Intel test windows 8 can
      work in x2apic mode w/ Interrupt Remapping enabled recently.
      
      This patch enables TPR shadow for virtual x2apic mode to boost
      windows guest in x2apic mode even if w/o APICv.
      
      Can pass the kvm-unit-test.
      Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
      Suggested-by: NWincy Van <fanwenyi0529@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wincy Van <fanwenyi0529@gmail.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f6e90f9e
    • W
      KVM: nVMX: Fix reload apic access page warning · c83b6d15
      Wanpeng Li 提交于
      WARNING: CPU: 1 PID: 4230 at kernel/sched/core.c:7564 __might_sleep+0x7e/0x80
      do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d0de7f9>] prepare_to_swait+0x39/0xa0
      CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47
      Call Trace:
       dump_stack+0x99/0xd0
       __warn+0xd1/0xf0
       warn_slowpath_fmt+0x4f/0x60
       ? prepare_to_swait+0x39/0xa0
       ? prepare_to_swait+0x39/0xa0
       __might_sleep+0x7e/0x80
       __gfn_to_pfn_memslot+0x156/0x480 [kvm]
       gfn_to_pfn+0x2a/0x30 [kvm]
       gfn_to_page+0xe/0x20 [kvm]
       kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm]
       nested_vmx_vmexit+0x765/0xca0 [kvm_intel]
       ? _raw_spin_unlock_irqrestore+0x36/0x80
       vmx_check_nested_events+0x49/0x1f0 [kvm_intel]
       kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm]
       kvm_vcpu_check_block+0x12/0x60 [kvm]
       kvm_vcpu_block+0x94/0x4c0 [kvm]
       kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm]
       ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm]
       kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      4.8.0-rc5+ #47 Not tainted
      -------------------------------
      ./include/linux/kvm_host.h:535 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      1 lock held by qemu-system-x86/4230:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffc062975c>] vcpu_load+0x1c/0x60 [kvm]
      
      stack backtrace:
      CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47
      Call Trace:
       dump_stack+0x99/0xd0
       lockdep_rcu_suspicious+0xe7/0x120
       gfn_to_memslot+0x12a/0x140 [kvm]
       gfn_to_pfn+0x12/0x30 [kvm]
       gfn_to_page+0xe/0x20 [kvm]
       kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm]
       nested_vmx_vmexit+0x765/0xca0 [kvm_intel]
       ? _raw_spin_unlock_irqrestore+0x36/0x80
       vmx_check_nested_events+0x49/0x1f0 [kvm_intel]
       kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm]
       kvm_vcpu_check_block+0x12/0x60 [kvm]
       kvm_vcpu_block+0x94/0x4c0 [kvm]
       kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm]
       ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm]
       kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
       ? __fget+0xfd/0x210
       ? __lock_is_held+0x54/0x70
       do_vfs_ioctl+0x96/0x6a0
       ? __fget+0x11c/0x210
       ? __fget+0x5/0x210
       SyS_ioctl+0x79/0x90
       do_syscall_64+0x81/0x220
       entry_SYSCALL64_slow_path+0x25/0x25
      
      These can be triggered by running kvm-unit-test: ./x86-run x86/vmx.flat
      
      The nested preemption timer is based on hrtimer which is started on L2
      entry, stopped on L2 exit and evaluated via the new check_nested_events
      hook. The current logic adds vCPU to a simple waitqueue (TASK_INTERRUPTIBLE)
      if need to yield pCPU and w/o holding srcu read lock when accesses memslots,
      both can be in nested preemption timer evaluation path which results in
      the warning above.
      
      This patch fix it by leveraging request bit to async reload APIC access
      page before vmentry in order to avoid to reload directly during the nested
      preemption timer evaluation, it is safe since the vmcs01 is loaded and
      current is nested vmexit.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c83b6d15
  7. 20 9月, 2016 5 次提交
  8. 16 9月, 2016 6 次提交
  9. 08 9月, 2016 10 次提交
  10. 05 9月, 2016 1 次提交
    • W
      KVM: lapic: adjust preemption timer correctly when goes TSC backward · e12c8f36
      Wanpeng Li 提交于
      TSC_OFFSET will be adjusted if discovers TSC backward during vCPU load.
      The preemption timer, which relies on the guest tsc to reprogram its
      preemption timer value, is also reprogrammed if vCPU is scheded in to
      a different pCPU. However, the current implementation reprogram preemption
      timer before TSC_OFFSET is adjusted to the right value, resulting in the
      preemption timer firing prematurely.
      
      This patch fix it by adjusting TSC_OFFSET before reprogramming preemption
      timer if TSC backward.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krċmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e12c8f36
  11. 19 8月, 2016 3 次提交
  12. 18 8月, 2016 3 次提交
    • P
      kvm: nVMX: fix nested tsc scaling · c95ba92a
      Peter Feiner 提交于
      When the host supported TSC scaling, L2 would use a TSC multiplier of
      0, which causes a VM entry failure. Now L2's TSC uses the same
      multiplier as L1.
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c95ba92a
    • R
      KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write · dccbfcf5
      Radim Krčmář 提交于
      If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
      write with vmcs02 as the current VMCS.
      This will incorrectly apply modifications intended for vmcs01 to vmcs02
      and L2 can use it to gain access to L0's x2APIC registers by disabling
      virtualized x2APIC while using msr bitmap that assumes enabled.
      
      Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
      current VMCS.  An alternative solution would temporarily make vmcs01 the
      current VMCS, but it requires more care.
      
      Fixes: 8d14695f ("x86, apicv: add virtual x2apic support")
      Reported-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      dccbfcf5
    • R
      KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC · d048c098
      Radim Krčmář 提交于
      msr bitmap can be used to avoid a VM exit (interception) on guest MSR
      accesses.  In some configurations of VMX controls, the guest can even
      directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
      APIC ACCESSES.
      
      L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
      To do so, L1 would first trick KVM to disable all possible interceptions
      by enabling APICv features and then would turn those features off;
      nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
      not intercept previously enabled MSRs even though they were not safe
      with the new configuration.
      
      Correctly re-enabling interceptions is not enough as a second bug would
      still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
      VMCSs, so L1 could trigger a race to get the desired combination of msr
      bitmap and VMX controls.
      
      This fix allocates a msr bitmap for every L1 VCPU, allows only safe
      x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
      have to intercept everything anyway.
      
      Fixes: 3af18d9c ("KVM: nVMX: Prepare for using hardware MSR bitmap")
      Reported-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NWincy Van <fanwenyi0529@gmail.com>
      Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d048c098
  13. 10 8月, 2016 1 次提交
    • K
      x86: Apply more __ro_after_init and const · 404f6aac
      Kees Cook 提交于
      Guided by grsecurity's analogous __read_only markings in arch/x86,
      this applies several uses of __ro_after_init to structures that are
      only updated during __init, and const for some structures that are
      never updated.  Additionally extends __init markings to some functions
      that are only used during __init, and cleans up some missing C99 style
      static initializers.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Brown <david.brown@linaro.org>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Emese Revfy <re.emese@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-hardening@lists.openwall.com
      Link: http://lkml.kernel.org/r/20160808232906.GA29731@www.outflux.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      404f6aac
  14. 04 8月, 2016 1 次提交