1. 12 5月, 2016 1 次提交
  2. 03 5月, 2016 1 次提交
  3. 20 4月, 2016 1 次提交
  4. 11 4月, 2016 1 次提交
    • D
      kvm: x86: do not leak guest xcr0 into host interrupt handlers · fc5b7f3b
      David Matlack 提交于
      An interrupt handler that uses the fpu can kill a KVM VM, if it runs
      under the following conditions:
       - the guest's xcr0 register is loaded on the cpu
       - the guest's fpu context is not loaded
       - the host is using eagerfpu
      
      Note that the guest's xcr0 register and fpu context are not loaded as
      part of the atomic world switch into "guest mode". They are loaded by
      KVM while the cpu is still in "host mode".
      
      Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
      interrupt handler will look something like this:
      
      if (irq_fpu_usable()) {
              kernel_fpu_begin();
      
              [... code that uses the fpu ...]
      
              kernel_fpu_end();
      }
      
      As long as the guest's fpu is not loaded and the host is using eager
      fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
      returns true). The interrupt handler proceeds to use the fpu with
      the guest's xcr0 live.
      
      kernel_fpu_begin() saves the current fpu context. If this uses
      XSAVE[OPT], it may leave the xsave area in an undesirable state.
      According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
      if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
      xcr0[i] == 0 following an XSAVE.
      
      kernel_fpu_end() restores the fpu context. Now if any bit i in
      XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
      fault is trapped and SIGSEGV is delivered to the current process.
      
      Only pre-4.2 kernels appear to be vulnerable to this sequence of
      events. Commit 653f52c3 ("kvm,x86: load guest FPU context more eagerly")
      from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
      
      This patch fixes the bug by keeping the host's xcr0 loaded outside
      of the interrupts-disabled region where KVM switches into guest mode.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      [Move load after goto cancel_injection. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc5b7f3b
  5. 01 4月, 2016 1 次提交
    • Y
      KVM: x86: Inject pending interrupt even if pending nmi exist · 321c5658
      Yuki Shibuya 提交于
      Non maskable interrupts (NMI) are preferred to interrupts in current
      implementation. If a NMI is pending and NMI is blocked by the result
      of nmi_allowed(), pending interrupt is not injected and
      enable_irq_window() is not executed, even if interrupts injection is
      allowed.
      
      In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
      In this case, interrupts are needed to execute iret that intends end
      of NMI. The flag of blocking new NMI is not cleared until the guest
      execute the iret, and interrupts are blocked by pending NMI. Due to
      this, iret can't be invoked in the guest, and the guest is starved
      until block is cleared by some events (e.g. canceling injection).
      
      This patch injects pending interrupts, when it's allowed, even if NMI
      is blocked. And, If an interrupts is pending after executing
      inject_pending_event(), enable_irq_window() is executed regardless of
      NMI pending counter.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NYuki Shibuya <shibuya.yk@ncos.nec.co.jp>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      321c5658
  6. 22 3月, 2016 3 次提交
  7. 09 3月, 2016 1 次提交
  8. 04 3月, 2016 4 次提交
  9. 03 3月, 2016 4 次提交
  10. 26 2月, 2016 1 次提交
    • P
      KVM: x86: fix root cause for missed hardware breakpoints · 70e4da7a
      Paolo Bonzini 提交于
      Commit 172b2386 ("KVM: x86: fix missed hardware breakpoints",
      2016-02-10) worked around a case where the debug registers are not loaded
      correctly on preemption and on the first entry to KVM_RUN.
      
      However, Xiao Guangrong pointed out that the root cause must be that
      KVM_DEBUGREG_BP_ENABLED is not being set correctly.  This can indeed
      happen due to the lazy debug exit mechanism, which does not call
      kvm_update_dr7.  Fix it by replacing the existing loop (more or less
      equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr*
      functions.
      
      Cc: stable@vger.kernel.org   # 4.1+
      Fixes: 172b2386Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      70e4da7a
  11. 24 2月, 2016 2 次提交
    • P
      KVM: x86: fix missed hardware breakpoints · 172b2386
      Paolo Bonzini 提交于
      Sometimes when setting a breakpoint a process doesn't stop on it.
      This is because the debug registers are not loaded correctly on
      VCPU load.
      
      The following simple reproducer from Oleg Nesterov tries using debug
      registers in two threads.  To see the bug, run a 2-VCPU guest with
      "taskset -c 0" and run "./bp 0 1" inside the guest.
      
          #include <unistd.h>
          #include <signal.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <sys/wait.h>
          #include <sys/ptrace.h>
          #include <sys/user.h>
          #include <asm/debugreg.h>
          #include <assert.h>
      
          #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
      
          unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len)
          {
              unsigned long dr7;
      
              dr7 = ((len | type) & 0xf)
                  << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
              if (enable)
                  dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));
      
              return dr7;
          }
      
          int write_dr(int pid, int dr, unsigned long val)
          {
              return ptrace(PTRACE_POKEUSER, pid,
                      offsetof (struct user, u_debugreg[dr]),
                      val);
          }
      
          void set_bp(pid_t pid, void *addr)
          {
              unsigned long dr7;
              assert(write_dr(pid, 0, (long)addr) == 0);
              dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
              assert(write_dr(pid, 7, dr7) == 0);
          }
      
          void *get_rip(int pid)
          {
              return (void*)ptrace(PTRACE_PEEKUSER, pid,
                      offsetof(struct user, regs.rip), 0);
          }
      
          void test(int nr)
          {
              void *bp_addr = &&label + nr, *bp_hit;
              int pid;
      
              printf("test bp %d\n", nr);
              assert(nr < 16); // see 16 asm nops below
      
              pid = fork();
              if (!pid) {
                  assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
                  kill(getpid(), SIGSTOP);
                  for (;;) {
                      label: asm (
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                      );
                  }
              }
      
              assert(pid == wait(NULL));
              set_bp(pid, bp_addr);
      
              for (;;) {
                  assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
                  assert(pid == wait(NULL));
      
                  bp_hit = get_rip(pid);
                  if (bp_hit != bp_addr)
                      fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
                          bp_hit - &&label, nr);
              }
          }
      
          int main(int argc, const char *argv[])
          {
              while (--argc) {
                  int nr = atoi(*++argv);
                  if (!fork())
                      test(nr);
              }
      
              while (wait(NULL) > 0)
                  ;
              return 0;
          }
      
      Cc: stable@vger.kernel.org
      Suggested-by: NNadav Amit <namit@cs.technion.ac.il>
      Reported-by: NAndrey Wagin <avagin@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      172b2386
    • A
      x86: Fix misspellings in comments · 6a6256f9
      Adam Buchbinder 提交于
      Signed-off-by: NAdam Buchbinder <adam.buchbinder@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: trivial@kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6a6256f9
  12. 17 2月, 2016 4 次提交
    • P
      KVM: x86: pass kvm_get_time_scale arguments in hertz · 3ae13faa
      Paolo Bonzini 提交于
      Prepare for improving the precision in the next patch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ae13faa
    • P
      KVM: x86: fix missed hardware breakpoints · 4e422bdd
      Paolo Bonzini 提交于
      Sometimes when setting a breakpoint a process doesn't stop on it.
      This is because the debug registers are not loaded correctly on
      VCPU load.
      
      The following simple reproducer from Oleg Nesterov tries using debug
      registers in both the host and the guest, for example by running "./bp
      0 1" on the host and "./bp 14 15" under QEMU.
      
          #include <unistd.h>
          #include <signal.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <sys/wait.h>
          #include <sys/ptrace.h>
          #include <sys/user.h>
          #include <asm/debugreg.h>
          #include <assert.h>
      
          #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
      
          unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len)
          {
              unsigned long dr7;
      
              dr7 = ((len | type) & 0xf)
                  << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
              if (enable)
                  dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));
      
              return dr7;
          }
      
          int write_dr(int pid, int dr, unsigned long val)
          {
              return ptrace(PTRACE_POKEUSER, pid,
                      offsetof (struct user, u_debugreg[dr]),
                      val);
          }
      
          void set_bp(pid_t pid, void *addr)
          {
              unsigned long dr7;
              assert(write_dr(pid, 0, (long)addr) == 0);
              dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
              assert(write_dr(pid, 7, dr7) == 0);
          }
      
          void *get_rip(int pid)
          {
              return (void*)ptrace(PTRACE_PEEKUSER, pid,
                      offsetof(struct user, regs.rip), 0);
          }
      
          void test(int nr)
          {
              void *bp_addr = &&label + nr, *bp_hit;
              int pid;
      
              printf("test bp %d\n", nr);
              assert(nr < 16); // see 16 asm nops below
      
              pid = fork();
              if (!pid) {
                  assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
                  kill(getpid(), SIGSTOP);
                  for (;;) {
                      label: asm (
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                      );
                  }
              }
      
              assert(pid == wait(NULL));
              set_bp(pid, bp_addr);
      
              for (;;) {
                  assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
                  assert(pid == wait(NULL));
      
                  bp_hit = get_rip(pid);
                  if (bp_hit != bp_addr)
                      fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
                          bp_hit - &&label, nr);
              }
          }
      
          int main(int argc, const char *argv[])
          {
              while (--argc) {
                  int nr = atoi(*++argv);
                  if (!fork())
                      test(nr);
              }
      
              while (wait(NULL) > 0)
                  ;
              return 0;
          }
      
      Cc: stable@vger.kernel.org
      Suggested-by: NNadadv Amit <namit@cs.technion.ac.il>
      Reported-by: NAndrey Wagin <avagin@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4e422bdd
    • P
      KVM: x86: rewrite handling of scaled TSC for kvmclock · 78db6a50
      Paolo Bonzini 提交于
      This is the same as before:
      
          kvm_scale_tsc(tgt_tsc_khz)
              = tgt_tsc_khz * ratio
              = tgt_tsc_khz * user_tsc_khz / tsc_khz   (see set_tsc_khz)
              = user_tsc_khz                           (see kvm_guest_time_update)
              = vcpu->arch.virtual_tsc_khz             (see kvm_set_tsc_khz)
      
      However, computing it through kvm_scale_tsc will make it possible
      to include the NTP correction in tgt_tsc_khz.
      Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      78db6a50
    • P
      KVM: x86: rename argument to kvm_set_tsc_khz · 4941b8cb
      Paolo Bonzini 提交于
      This refers to the desired (scaled) frequency, which is called
      user_tsc_khz in the rest of the file.
      Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4941b8cb
  13. 09 2月, 2016 3 次提交
  14. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  15. 09 1月, 2016 2 次提交
  16. 07 1月, 2016 2 次提交
  17. 23 12月, 2015 1 次提交
  18. 22 12月, 2015 1 次提交
  19. 17 12月, 2015 1 次提交
    • A
      kvm/x86: Hyper-V SynIC timers · 1f4b34f8
      Andrey Smetanin 提交于
      Per Hyper-V specification (and as required by Hyper-V-aware guests),
      SynIC provides 4 per-vCPU timers.  Each timer is programmed via a pair
      of MSRs, and signals expiration by delivering a special format message
      to the configured SynIC message slot and triggering the corresponding
      synthetic interrupt.
      
      Note: as implemented by this patch, all periodic timers are "lazy"
      (i.e. if the vCPU wasn't scheduled for more than the timer period the
      timer events are lost), regardless of the corresponding configuration
      MSR.  If deemed necessary, the "catch up" mode (the timer period is
      shortened until the timer catches up) will be implemented later.
      
      Changes v2:
      * Use remainder to calculate periodic timer expiration time
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: "K. Y. Srinivasan" <kys@microsoft.com>
      CC: Haiyang Zhang <haiyangz@microsoft.com>
      CC: Vitaly Kuznetsov <vkuznets@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1f4b34f8
  20. 11 12月, 2015 2 次提交
  21. 26 11月, 2015 3 次提交
    • P
      KVM: x86: expose MSR_TSC_AUX to userspace · 9dbe6cf9
      Paolo Bonzini 提交于
      If we do not do this, it is not properly saved and restored across
      migration.  Windows notices due to its self-protection mechanisms,
      and is very upset about it (blue screen of death).
      
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9dbe6cf9
    • A
      kvm/x86: Hyper-V kvm exit · db397571
      Andrey Smetanin 提交于
      A new vcpu exit is introduced to notify the userspace of the
      changes in Hyper-V SynIC configuration triggered by guest writing to the
      corresponding MSRs.
      
      Changes v4:
      * exit into userspace only if guest writes into SynIC MSR's
      
      Changes v3:
      * added KVM_EXIT_HYPERV types and structs notes into docs
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db397571
    • A
      kvm/x86: Hyper-V synthetic interrupt controller · 5c919412
      Andrey Smetanin 提交于
      SynIC (synthetic interrupt controller) is a lapic extension,
      which is controlled via MSRs and maintains for each vCPU
       - 16 synthetic interrupt "lines" (SINT's); each can be configured to
         trigger a specific interrupt vector optionally with auto-EOI
         semantics
       - a message page in the guest memory with 16 256-byte per-SINT message
         slots
       - an event flag page in the guest memory with 16 2048-bit per-SINT
         event flag areas
      
      The host triggers a SINT whenever it delivers a new message to the
      corresponding slot or flips an event flag bit in the corresponding area.
      The guest informs the host that it can try delivering a message by
      explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
      MSR.
      
      The userspace (qemu) triggers interrupts and receives EOM notifications
      via irqfd with resampler; for that, a GSI is allocated for each
      configured SINT, and irq_routing api is extended to support GSI-SINT
      mapping.
      
      Changes v4:
      * added activation of SynIC by vcpu KVM_ENABLE_CAP
      * added per SynIC active flag
      * added deactivation of APICv upon SynIC activation
      
      Changes v3:
      * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
      docs
      
      Changes v2:
      * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
      * add Hyper-V SynIC vectors into EOI exit bitmap
      * Hyper-V SyniIC SINT msr write logic simplified
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c919412