1. 19 5月, 2016 2 次提交
  2. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  3. 12 5月, 2016 1 次提交
    • G
      kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd
      Greg Kurz 提交于
      The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
      also the upper limit for vCPU ids. This is okay for all archs except PowerPC
      which can have higher ids, depending on the cpu/core/thread topology. In the
      worst case (single threaded guest, host with 8 threads per core), it limits
      the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
      
      This patch separates the vCPU numbering from the total number of vCPUs, with
      the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
      plus one.
      
      The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
      before passing them to KVM_CREATE_VCPU.
      
      This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
      Other archs continue to return KVM_MAX_VCPUS instead.
      Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b1b1dfd
  4. 22 3月, 2016 3 次提交
    • L
      KVM: Replace smp_mb() with smp_load_acquire() in the kvm_flush_remote_tlbs() · 4ae3cb3a
      Lan Tianyu 提交于
      smp_load_acquire() is enough here and it's cheaper than smp_mb().
      Adding a comment about reusing memory barrier of kvm_make_all_cpus_request()
      here to keep order between modifications to the page tables and reading mode.
      Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ae3cb3a
    • L
    • P
      KVM: fix spin_lock_init order on x86 · e9ad4ec8
      Paolo Bonzini 提交于
      Moving the initialization earlier is needed in 4.6 because
      kvm_arch_init_vm is now using mmu_lock, causing lockdep to
      complain:
      
      [  284.440294] INFO: trying to register non-static key.
      [  284.445259] the code is fine but needs lockdep annotation.
      [  284.450736] turning off the locking correctness validator.
      ...
      [  284.528318]  [<ffffffff810aecc3>] lock_acquire+0xd3/0x240
      [  284.533733]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.541467]  [<ffffffff81715581>] _raw_spin_lock+0x41/0x80
      [  284.546960]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.554707]  [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.562281]  [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm]
      [  284.568381]  [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm]
      [  284.574740]  [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm]
      
      However, it also helps fixing a preexisting problem, which is why this
      patch is also good for stable kernels: kvm_create_vm was incrementing
      current->mm->mm_count but not decrementing it at the out_err label (in
      case kvm_init_mmu_notifier failed).  The new initialization order makes
      it possible to add the required mmdrop without adding a new error label.
      
      Cc: stable@vger.kernel.org
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9ad4ec8
  5. 09 3月, 2016 1 次提交
    • D
      kvm: cap halt polling at exactly halt_poll_ns · 313f636d
      David Matlack 提交于
      When growing halt-polling, there is no check that the poll time exceeds
      the limit. It's possible for vcpu->halt_poll_ns grow once past
      halt_poll_ns, and stay there until a halt which takes longer than
      vcpu->halt_poll_ns. For example, booting a Linux guest with
      halt_poll_ns=11000:
      
       ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000)
       ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0)
       ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000)
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Fixes: aca6ff29
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      313f636d
  6. 04 3月, 2016 1 次提交
    • P
      KVM: ensure __gfn_to_pfn_memslot initializes *writable · b2740d35
      Paolo Bonzini 提交于
      For the kvm_is_error_hva, ubsan complains if the uninitialized writable
      is passed to __direct_map, even though the value itself is not used
      (__direct_map goes to mmu_set_spte->set_spte->set_mmio_spte but never
      looks at that argument).
      
      Ensuring that __gfn_to_pfn_memslot initializes *writable is cheap and
      avoids this kind of issue.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2740d35
  7. 25 2月, 2016 1 次提交
    • M
      KVM: Use simple waitqueue for vcpu->wq · 8577370f
      Marcelo Tosatti 提交于
      The problem:
      
      On -rt, an emulated LAPIC timer instances has the following path:
      
      1) hard interrupt
      2) ksoftirqd is scheduled
      3) ksoftirqd wakes up vcpu thread
      4) vcpu thread is scheduled
      
      This extra context switch introduces unnecessary latency in the
      LAPIC path for a KVM guest.
      
      The solution:
      
      Allow waking up vcpu thread from hardirq context,
      thus avoiding the need for ksoftirqd to be scheduled.
      
      Normal waitqueues make use of spinlocks, which on -RT
      are sleepable locks. Therefore, waking up a waitqueue
      waiter involves locking a sleeping lock, which
      is not allowed from hard interrupt context.
      
      cyclictest command line:
      
      This patch reduces the average latency in my tests from 14us to 11us.
      
      Daniel writes:
      Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
      benchmark on mainline. The test was run 1000 times on
      tip/sched/core 4.4.0-rc8-01134-g0905f04e:
      
        ./x86-run x86/tscdeadline_latency.flat -cpu host
      
      with idle=poll.
      
      The test seems not to deliver really stable numbers though most of
      them are smaller. Paolo write:
      
      "Anything above ~10000 cycles means that the host went to C1 or
      lower---the number means more or less nothing in that case.
      
      The mean shows an improvement indeed."
      
      Before:
      
                     min             max         mean           std
      count  1000.000000     1000.000000  1000.000000   1000.000000
      mean   5162.596000  2019270.084000  5824.491541  20681.645558
      std      75.431231   622607.723969    89.575700   6492.272062
      min    4466.000000    23928.000000  5537.926500    585.864966
      25%    5163.000000  1613252.750000  5790.132275  16683.745433
      50%    5175.000000  2281919.000000  5834.654000  23151.990026
      75%    5190.000000  2382865.750000  5861.412950  24148.206168
      max    5228.000000  4175158.000000  6254.827300  46481.048691
      
      After
                     min            max         mean           std
      count  1000.000000     1000.00000  1000.000000   1000.000000
      mean   5143.511000  2076886.10300  5813.312474  21207.357565
      std      77.668322   610413.09583    86.541500   6331.915127
      min    4427.000000    25103.00000  5529.756600    559.187707
      25%    5148.000000  1691272.75000  5784.889825  17473.518244
      50%    5160.000000  2308328.50000  5832.025000  23464.837068
      75%    5172.000000  2393037.75000  5853.177675  24223.969976
      max    5222.000000  3922458.00000  6186.720500  42520.379830
      
      [Patch was originaly based on the swait implementation found in the -rt
       tree. Daniel ported it to mainline's version and gathered the
       benchmark numbers for tscdeadline_latency test.]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8577370f
  8. 23 2月, 2016 1 次提交
  9. 17 2月, 2016 1 次提交
  10. 16 2月, 2016 1 次提交
    • D
      mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm · d4edcf0d
      Dave Hansen 提交于
      We will soon modify the vanilla get_user_pages() so it can no
      longer be used on mm/tasks other than 'current/current->mm',
      which is by far the most common way it is called.  For now,
      we allow the old-style calls, but warn when they are used.
      (implemented in previous patch)
      
      This patch switches all callers of:
      
      	get_user_pages()
      	get_user_pages_unlocked()
      	get_user_pages_locked()
      
      to stop passing tsk/mm so they will no longer see the warnings.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4edcf0d
  11. 16 1月, 2016 1 次提交
    • D
      kvm: rename pfn_t to kvm_pfn_t · ba049e93
      Dan Williams 提交于
      To date, we have implemented two I/O usage models for persistent memory,
      PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
      userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
      to be the target of direct-i/o.  It allows userspace to coordinate
      DMA/RDMA from/to persistent memory.
      
      The implementation leverages the ZONE_DEVICE mm-zone that went into
      4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
      and dynamically mapped by a device driver.  The pmem driver, after
      mapping a persistent memory range into the system memmap via
      devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
      page-backed pmem-pfns via flags in the new pfn_t type.
      
      The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
      resulting pte(s) inserted into the process page tables with a new
      _PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
      off _PAGE_DEVMAP to pin the device hosting the page range active.
      Finally, get_page() and put_page() are modified to take references
      against the device driver established page mapping.
      
      Finally, this need for "struct page" for persistent memory requires
      memory capacity to store the memmap array.  Given the memmap array for a
      large pool of persistent may exhaust available DRAM introduce a
      mechanism to allocate the memmap from persistent memory.  The new
      "struct vmem_altmap *" parameter to devm_memremap_pages() enables
      arch_add_memory() to use reserved pmem capacity rather than the page
      allocator.
      
      This patch (of 18):
      
      The core has developed a need for a "pfn_t" type [1].  Move the existing
      pfn_t in KVM to kvm_pfn_t [2].
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.htmlSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba049e93
  12. 09 1月, 2016 1 次提交
  13. 30 11月, 2015 2 次提交
  14. 26 11月, 2015 1 次提交
  15. 23 10月, 2015 1 次提交
  16. 01 10月, 2015 3 次提交
  17. 25 9月, 2015 1 次提交
  18. 16 9月, 2015 1 次提交
    • P
      KVM: add halt_attempted_poll to VCPU stats · 62bea5bf
      Paolo Bonzini 提交于
      This new statistic can help diagnosing VCPUs that, for any reason,
      trigger bad behavior of halt_poll_ns autotuning.
      
      For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
      like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
      10+20+40+80+160+320+480 = 1110 microseconds out of every
      479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
      is consuming about 30% more CPU than it would use without
      polling.  This would show as an abnormally high number of
      attempted polling compared to the successful polls.
      
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62bea5bf
  19. 15 9月, 2015 1 次提交
  20. 14 9月, 2015 1 次提交
  21. 11 9月, 2015 1 次提交
    • V
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov 提交于
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6
  22. 06 9月, 2015 3 次提交
    • W
      KVM: trace kvm_halt_poll_ns grow/shrink · 2cbd7824
      Wanpeng Li 提交于
      Tracepoint for dynamic halt_pool_ns, fired on every potential change.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2cbd7824
    • W
      KVM: dynamic halt-polling · aca6ff29
      Wanpeng Li 提交于
      There is a downside of always-poll since poll is still happened for idle
      vCPUs which can waste cpu usage. This patchset add the ability to adjust
      halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
      and to shrink halt_poll_ns when long halt is detected.
      
      There are two new kernel parameters for changing the halt_poll_ns:
      halt_poll_ns_grow and halt_poll_ns_shrink.
      
                              no-poll      always-poll    dynamic-poll
      -----------------------------------------------------------------------
      Idle (nohz) vCPU %c0     0.15%        0.3%            0.2%
      Idle (250HZ) vCPU %c0    1.1%         4.6%~14%        1.2%
      TCP_RR latency           34us         27us            26.7us
      
      "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
      c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
      guest was tickless. (250HZ) means the guest was ticking at 250HZ.
      
      The big win is with ticking operating systems. Running the linux guest
      with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
      to no-polling overhead levels by using the dynamic-poll. The savings
      should be even higher for higher frequency ticks.
      Suggested-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [Simplify the patch. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      aca6ff29
    • W
      KVM: make halt_poll_ns per-vCPU · 19020f8a
      Wanpeng Li 提交于
      Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
      to allow greater flexibility.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19020f8a
  23. 30 7月, 2015 1 次提交
  24. 29 7月, 2015 1 次提交
  25. 04 7月, 2015 1 次提交
  26. 05 6月, 2015 2 次提交
    • P
      KVM: implement multiple address spaces · f481b069
      Paolo Bonzini 提交于
      Only two ioctls have to be modified; the address space id is
      placed in the higher 16 bits of their slot id argument.
      
      As of this patch, no architecture defines more than one
      address space; x86 will be the first.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f481b069
    • P
      KVM: add vcpu-specific functions to read/write/translate GFNs · 8e73485c
      Paolo Bonzini 提交于
      We need to hide SMRAM from guests not running in SMM.  Therefore, all
      uses of kvm_read_guest* and kvm_write_guest* must be changed to use
      different address spaces, depending on whether the VCPU is in system
      management mode.  We need to introduce a new family of functions for
      this purpose.
      
      For now, the VCPU-based functions have the same behavior as the
      existing per-VM ones, they just accept a different type for the
      first argument.  Later however they will be changed to use one of many
      "struct kvm_memslots" stored in struct kvm, through an architecture hook.
      VM-based functions will unconditionally use the first memslots pointer.
      
      Whenever possible, this patch introduces slot-based functions with an
      __ prefix, with two wrappers for generic and vcpu-based actions.
      The exceptions are kvm_read_guest and kvm_write_guest, which are copied
      into the new functions kvm_vcpu_read_guest and kvm_vcpu_write_guest.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8e73485c
  27. 28 5月, 2015 4 次提交
  28. 26 5月, 2015 1 次提交