1. 03 3月, 2016 3 次提交
    • X
      KVM: page track: add notifier support · 0eb05bf2
      Xiao Guangrong 提交于
      Notifier list is introduced so that any node wants to receive the track
      event can register to the list
      
      Two APIs are introduced here:
      - kvm_page_track_register_notifier(): register the notifier to receive
        track event
      
      - kvm_page_track_unregister_notifier(): stop receiving track event by
        unregister the notifier
      
      The callback, node->track_write() is called when a write access on the
      write tracked page happens
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0eb05bf2
    • X
      KVM: page track: add the framework of guest page tracking · 21ebbeda
      Xiao Guangrong 提交于
      The array, gfn_track[mode][gfn], is introduced in memory slot for every
      guest page, this is the tracking count for the gust page on different
      modes. If the page is tracked then the count is increased, the page is
      not tracked after the count reaches zero
      
      We use 'unsigned short' as the tracking count which should be enough as
      shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
      for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
      smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
      is enough room for other trackers
      
      Two callbacks, kvm_page_track_create_memslot() and
      kvm_page_track_free_memslot() are implemented in this patch, they are
      internally used to initialize and reclaim the memory of the array
      
      Currently, only write track mode is supported
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      21ebbeda
    • X
      KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed · 92f94f1e
      Xiao Guangrong 提交于
      kvm_lpage_info->write_count is used to detect if the large page mapping
      for the gfn on the specified level is allowed, rename it to disallow_lpage
      to reflect its purpose, also we rename has_wrprotected_page() to
      mmu_gfn_lpage_is_disallowed() to make the code more clearer
      
      Later we will extend this mechanism for page tracking: if the gfn is
      tracked then large mapping for that gfn on any level is not allowed.
      The new name is more straightforward
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      92f94f1e
  2. 17 2月, 2016 4 次提交
    • P
      KVM: x86: pass kvm_get_time_scale arguments in hertz · 3ae13faa
      Paolo Bonzini 提交于
      Prepare for improving the precision in the next patch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ae13faa
    • P
      KVM: x86: fix missed hardware breakpoints · 4e422bdd
      Paolo Bonzini 提交于
      Sometimes when setting a breakpoint a process doesn't stop on it.
      This is because the debug registers are not loaded correctly on
      VCPU load.
      
      The following simple reproducer from Oleg Nesterov tries using debug
      registers in both the host and the guest, for example by running "./bp
      0 1" on the host and "./bp 14 15" under QEMU.
      
          #include <unistd.h>
          #include <signal.h>
          #include <stdlib.h>
          #include <stdio.h>
          #include <sys/wait.h>
          #include <sys/ptrace.h>
          #include <sys/user.h>
          #include <asm/debugreg.h>
          #include <assert.h>
      
          #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
      
          unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len)
          {
              unsigned long dr7;
      
              dr7 = ((len | type) & 0xf)
                  << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
              if (enable)
                  dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));
      
              return dr7;
          }
      
          int write_dr(int pid, int dr, unsigned long val)
          {
              return ptrace(PTRACE_POKEUSER, pid,
                      offsetof (struct user, u_debugreg[dr]),
                      val);
          }
      
          void set_bp(pid_t pid, void *addr)
          {
              unsigned long dr7;
              assert(write_dr(pid, 0, (long)addr) == 0);
              dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
              assert(write_dr(pid, 7, dr7) == 0);
          }
      
          void *get_rip(int pid)
          {
              return (void*)ptrace(PTRACE_PEEKUSER, pid,
                      offsetof(struct user, regs.rip), 0);
          }
      
          void test(int nr)
          {
              void *bp_addr = &&label + nr, *bp_hit;
              int pid;
      
              printf("test bp %d\n", nr);
              assert(nr < 16); // see 16 asm nops below
      
              pid = fork();
              if (!pid) {
                  assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
                  kill(getpid(), SIGSTOP);
                  for (;;) {
                      label: asm (
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                          "nop; nop; nop; nop;"
                      );
                  }
              }
      
              assert(pid == wait(NULL));
              set_bp(pid, bp_addr);
      
              for (;;) {
                  assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
                  assert(pid == wait(NULL));
      
                  bp_hit = get_rip(pid);
                  if (bp_hit != bp_addr)
                      fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
                          bp_hit - &&label, nr);
              }
          }
      
          int main(int argc, const char *argv[])
          {
              while (--argc) {
                  int nr = atoi(*++argv);
                  if (!fork())
                      test(nr);
              }
      
              while (wait(NULL) > 0)
                  ;
              return 0;
          }
      
      Cc: stable@vger.kernel.org
      Suggested-by: NNadadv Amit <namit@cs.technion.ac.il>
      Reported-by: NAndrey Wagin <avagin@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4e422bdd
    • P
      KVM: x86: rewrite handling of scaled TSC for kvmclock · 78db6a50
      Paolo Bonzini 提交于
      This is the same as before:
      
          kvm_scale_tsc(tgt_tsc_khz)
              = tgt_tsc_khz * ratio
              = tgt_tsc_khz * user_tsc_khz / tsc_khz   (see set_tsc_khz)
              = user_tsc_khz                           (see kvm_guest_time_update)
              = vcpu->arch.virtual_tsc_khz             (see kvm_set_tsc_khz)
      
      However, computing it through kvm_scale_tsc will make it possible
      to include the NTP correction in tgt_tsc_khz.
      Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      78db6a50
    • P
      KVM: x86: rename argument to kvm_set_tsc_khz · 4941b8cb
      Paolo Bonzini 提交于
      This refers to the desired (scaled) frequency, which is called
      user_tsc_khz in the rest of the file.
      Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4941b8cb
  3. 09 2月, 2016 3 次提交
  4. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  5. 09 1月, 2016 2 次提交
  6. 07 1月, 2016 2 次提交
  7. 23 12月, 2015 1 次提交
  8. 22 12月, 2015 1 次提交
  9. 17 12月, 2015 1 次提交
    • A
      kvm/x86: Hyper-V SynIC timers · 1f4b34f8
      Andrey Smetanin 提交于
      Per Hyper-V specification (and as required by Hyper-V-aware guests),
      SynIC provides 4 per-vCPU timers.  Each timer is programmed via a pair
      of MSRs, and signals expiration by delivering a special format message
      to the configured SynIC message slot and triggering the corresponding
      synthetic interrupt.
      
      Note: as implemented by this patch, all periodic timers are "lazy"
      (i.e. if the vCPU wasn't scheduled for more than the timer period the
      timer events are lost), regardless of the corresponding configuration
      MSR.  If deemed necessary, the "catch up" mode (the timer period is
      shortened until the timer catches up) will be implemented later.
      
      Changes v2:
      * Use remainder to calculate periodic timer expiration time
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: "K. Y. Srinivasan" <kys@microsoft.com>
      CC: Haiyang Zhang <haiyangz@microsoft.com>
      CC: Vitaly Kuznetsov <vkuznets@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1f4b34f8
  10. 11 12月, 2015 2 次提交
  11. 26 11月, 2015 5 次提交
    • P
      KVM: x86: expose MSR_TSC_AUX to userspace · 9dbe6cf9
      Paolo Bonzini 提交于
      If we do not do this, it is not properly saved and restored across
      migration.  Windows notices due to its self-protection mechanisms,
      and is very upset about it (blue screen of death).
      
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9dbe6cf9
    • A
      kvm/x86: Hyper-V kvm exit · db397571
      Andrey Smetanin 提交于
      A new vcpu exit is introduced to notify the userspace of the
      changes in Hyper-V SynIC configuration triggered by guest writing to the
      corresponding MSRs.
      
      Changes v4:
      * exit into userspace only if guest writes into SynIC MSR's
      
      Changes v3:
      * added KVM_EXIT_HYPERV types and structs notes into docs
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db397571
    • A
      kvm/x86: Hyper-V synthetic interrupt controller · 5c919412
      Andrey Smetanin 提交于
      SynIC (synthetic interrupt controller) is a lapic extension,
      which is controlled via MSRs and maintains for each vCPU
       - 16 synthetic interrupt "lines" (SINT's); each can be configured to
         trigger a specific interrupt vector optionally with auto-EOI
         semantics
       - a message page in the guest memory with 16 256-byte per-SINT message
         slots
       - an event flag page in the guest memory with 16 2048-bit per-SINT
         event flag areas
      
      The host triggers a SINT whenever it delivers a new message to the
      corresponding slot or flips an event flag bit in the corresponding area.
      The guest informs the host that it can try delivering a message by
      explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
      MSR.
      
      The userspace (qemu) triggers interrupts and receives EOM notifications
      via irqfd with resampler; for that, a GSI is allocated for each
      configured SINT, and irq_routing api is extended to support GSI-SINT
      mapping.
      
      Changes v4:
      * added activation of SynIC by vcpu KVM_ENABLE_CAP
      * added per SynIC active flag
      * added deactivation of APICv upon SynIC activation
      
      Changes v3:
      * added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
      docs
      
      Changes v2:
      * do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
      * add Hyper-V SynIC vectors into EOI exit bitmap
      * Hyper-V SyniIC SINT msr write logic simplified
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5c919412
    • A
      kvm/x86: per-vcpu apicv deactivation support · d62caabb
      Andrey Smetanin 提交于
      The decision on whether to use hardware APIC virtualization used to be
      taken globally, based on the availability of the feature in the CPU
      and the value of a module parameter.
      
      However, under certain circumstances we want to control it on per-vcpu
      basis.  In particular, when the userspace activates HyperV synthetic
      interrupt controller (SynIC), APICv has to be disabled as it's
      incompatible with SynIC auto-EOI behavior.
      
      To achieve that, introduce 'apicv_active' flag on struct
      kvm_vcpu_arch, and kvm_vcpu_deactivate_apicv() function to turn APICv
      off.  The flag is initialized based on the module parameter and CPU
      capability, and consulted whenever an APICv-specific action is
      performed.
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d62caabb
    • A
      kvm/x86: split ioapic-handled and EOI exit bitmaps · 6308630b
      Andrey Smetanin 提交于
      The function to determine if the vector is handled by ioapic used to
      rely on the fact that only ioapic-handled vectors were set up to
      cause vmexits when virtual apic was in use.
      
      We're going to break this assumption when introducing Hyper-V
      synthetic interrupts: they may need to cause vmexits too.
      
      To achieve that, introduce a new bitmap dedicated specifically for
      ioapic-handled vectors, and populate EOI exit bitmap from it for now.
      Signed-off-by: NAndrey Smetanin <asmetanin@virtuozzo.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      CC: Gleb Natapov <gleb@kernel.org>
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: Roman Kagan <rkagan@virtuozzo.com>
      CC: Denis V. Lunev <den@openvz.org>
      CC: qemu-devel@nongnu.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6308630b
  12. 18 11月, 2015 4 次提交
  13. 10 11月, 2015 10 次提交
  14. 04 11月, 2015 1 次提交
    • L
      KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0() · 879ae188
      Laszlo Ersek 提交于
      Commit b18d5431 ("KVM: x86: fix CR0.CD virtualization") was
      technically correct, but it broke OVMF guests by slowing down various
      parts of the firmware.
      
      Commit fb279950 ("KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED") quirked the
      first function modified by b18d5431, vmx_get_mt_mask(), for OVMF's
      sake. This restored the speed of the OVMF code that runs before
      PlatformPei (including the memory intensive LZMA decompression in SEC).
      
      This patch extends the quirk to the second function modified by
      b18d5431, kvm_set_cr0(). It eliminates the intrusive slowdown that
      hits the EFI_MP_SERVICES_PROTOCOL implementation of edk2's
      UefiCpuPkg/CpuDxe -- which is built into OVMF --, when CpuDxe starts up
      all APs at once for initialization, in order to count them.
      
      We also carry over the kvm_arch_has_noncoherent_dma() sub-condition from
      the other half of the original commit b18d5431.
      
      Fixes: b18d5431
      Cc: stable@vger.kernel.org
      Cc: Jordan Justen <jordan.l.justen@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Tested-by: NJanusz Mocek <januszmk6@gmail.com>
      Signed-off-by: Laszlo Ersek <lersek@redhat.com>#
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      879ae188