1. 25 12月, 2016 1 次提交
  2. 15 12月, 2016 1 次提交
    • L
      mm: unexport __get_user_pages_unlocked() · 8b7457ef
      Lorenzo Stoakes 提交于
      Unexport the low-level __get_user_pages_unlocked() function and replaces
      invocations with calls to more appropriate higher-level functions.
      
      In hva_to_pfn_slow() we are able to replace __get_user_pages_unlocked()
      with get_user_pages_unlocked() since we can now pass gup_flags.
      
      In async_pf_execute() and process_vm_rw_single_vec() we need to pass
      different tsk, mm arguments so get_user_pages_remote() is the sane
      replacement in these cases (having added manual acquisition and release
      of mmap_sem.)
      
      Additionally get_user_pages_remote() reintroduces use of the FOLL_TOUCH
      flag.  However, this flag was originally silently dropped by commit
      1e987790 ("mm/gup: Introduce get_user_pages_remote()"), so this
      appears to have been unintentional and reintroducing it is therefore not
      an issue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20161027095141.2569-3-lstoakes@gmail.comSigned-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b7457ef
  3. 01 12月, 2016 1 次提交
  4. 28 11月, 2016 1 次提交
    • S
      KVM: Export kvm module parameter variables · ec76d819
      Suraj Jitindar Singh 提交于
      The kvm module has the parameters halt_poll_ns, halt_poll_ns_grow, and
      halt_poll_ns_shrink. Halt polling was recently added to the powerpc kvm-hv
      module and these parameters were essentially duplicated for that. There is
      no benefit to this duplication and it can lead to confusion when trying to
      tune halt polling.
      
      Thus move the definition of these variables to kvm_host.h and export them.
      This will allow the kvm-hv module to use the same module parameters by
      accessing these variables, which will be implemented in the next patch,
      meaning that they will no longer be duplicated.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      ec76d819
  5. 22 11月, 2016 1 次提交
  6. 03 11月, 2016 1 次提交
  7. 26 10月, 2016 1 次提交
    • P
      KVM: fix OOPS on flush_work · 36343f6e
      Paolo Bonzini 提交于
      The conversion done by commit 3706feac ("KVM: Remove deprecated
      create_singlethread_workqueue") is broken.  It flushes a single work
      item &irqfd->shutdown instead of all of them, and even worse if there
      is no irqfd on the list then you get a NULL pointer dereference.
      Revert the virt/kvm/eventfd.c part of that patch; to avoid the
      deprecated function, just allocate our own workqueue---it does
      not even have to be unbound---with alloc_workqueue.
      
      Fixes: 3706feacReviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      36343f6e
  8. 25 10月, 2016 1 次提交
    • L
      mm: unexport __get_user_pages() · 0d731759
      Lorenzo Stoakes 提交于
      This patch unexports the low-level __get_user_pages() function.
      
      Recent refactoring of the get_user_pages* functions allow flags to be
      passed through get_user_pages() which eliminates the need for access to
      this function from its one user, kvm.
      
      We can see that the two calls to get_user_pages() which replace
      __get_user_pages() in kvm_main.c are equivalent by examining their call
      stacks:
      
        get_user_page_nowait():
          get_user_pages(start, 1, flags, page, NULL)
          __get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, start, 1,
      		     flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)
      
        check_user_page_hwpoison():
          get_user_pages(addr, 1, flags, NULL, NULL)
          __get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
      			    false, flags | FOLL_TOUCH)
          __get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
      		     NULL, NULL)
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d731759
  9. 19 10月, 2016 1 次提交
  10. 16 9月, 2016 2 次提交
  11. 08 9月, 2016 2 次提交
    • S
      KVM: Add provisioning for ulong vm stats and u64 vcpu stats · 8a7e75d4
      Suraj Jitindar Singh 提交于
      vms and vcpus have statistics associated with them which can be viewed
      within the debugfs. Currently it is assumed within the vcpu_stat_get() and
      vm_stat_get() functions that all of these statistics are represented as
      u32s, however the next patch adds some u64 vcpu statistics.
      
      Change all vcpu statistics to u64 and modify vcpu_stat_get() accordingly.
      Since vcpu statistics are per vcpu, they will only be updated by a single
      vcpu at a time so this shouldn't present a problem on 32-bit machines
      which can't atomically increment 64-bit numbers. However vm statistics
      could potentially be updated by multiple vcpus from that vm at a time.
      To avoid the overhead of atomics make all vm statistics ulong such that
      they are 64-bit on 64-bit systems where they can be atomically incremented
      and are 32-bit on 32-bit systems which may not be able to atomically
      increment 64-bit numbers. Modify vm_stat_get() to expect ulongs.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      8a7e75d4
    • B
      KVM: Remove deprecated create_singlethread_workqueue · 3706feac
      Bhaktipriya Shridhar 提交于
      The workqueue "irqfd_cleanup_wq" queues a single work item
      &irqfd->shutdown and hence doesn't require ordering. It is a host-wide
      workqueue for issuing deferred shutdown requests aggregated from all
      vm* instances. It is not being used on a memory reclaim path.
      Hence, it has been converted to use system_wq.
      The work item has been flushed in kvm_irqfd_release().
      
      The workqueue "wqueue" queues a single work item &timer->expired
      and hence doesn't require ordering. Also, it is not being used on
      a memory reclaim path. Hence, it has been converted to use system_wq.
      
      System workqueues have been able to handle high level of concurrency
      for a long time now and hence it's not required to have a singlethreaded
      workqueue just to gain concurrency. Unlike a dedicated per-cpu workqueue
      created with create_singlethread_workqueue(), system_wq allows multiple
      work items to overlap executions even on the same CPU; however, a
      per-cpu workqueue doesn't have any CPU locality or global ordering
      guarantee unless the target CPU is explicitly specified and thus the
      increase of local concurrency shouldn't make any difference.
      Signed-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3706feac
  12. 12 8月, 2016 2 次提交
  13. 19 7月, 2016 1 次提交
  14. 15 7月, 2016 5 次提交
  15. 05 7月, 2016 2 次提交
  16. 16 6月, 2016 3 次提交
    • X
      kvm: Fix irq route entries exceeding KVM_MAX_IRQ_ROUTES · caf1ff26
      Xiubo Li 提交于
      These days, we experienced one guest crash with 8 cores and 3 disks,
      with qemu error logs as bellow:
      
      qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
      kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
      
      And then we found one patch(bdf026317d) in qemu tree, which said
      could fix this bug.
      
      Execute the following script will reproduce the BUG quickly:
      
      irq_affinity.sh
      ========================================================================
      
      vda_irq_num=25
      vdb_irq_num=27
      while [ 1 ]
      do
          for irq in {1,2,4,8,10,20,40,80}
              do
                  echo $irq > /proc/irq/$vda_irq_num/smp_affinity
                  echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
                  dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
                  dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
              done
      done
      ========================================================================
      
      The following qemu log is added in the qemu code and is displayed when
      this bug reproduced:
      
      kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
      irq_routes->nr: 1024, gsi_count: 1024.
      
      That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
      but in the kernel code when routes->nr >= 1024, will just return -EINVAL;
      
      The nr is the number of the routing entries which is in of
      [1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].
      
      This patch fix the BUG above.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiubo Li <lixiubo@cmss.chinamobile.com>
      Signed-off-by: NWei Tang <tangwei@cmss.chinamobile.com>
      Signed-off-by: NZhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      caf1ff26
    • P
      KVM: remove kvm_vcpu_compatible · 557abc40
      Paolo Bonzini 提交于
      The new created_vcpus field makes it possible to avoid the race between
      irqchip and VCPU creation in a much nicer way; just check under kvm->lock
      whether a VCPU has already been created.
      
      We can then remove KVM_APIC_ARCHITECTURE too, because at this point the
      symbol is only governing the default definition of kvm_vcpu_compatible.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      557abc40
    • P
      KVM: introduce kvm->created_vcpus · 6c7caebc
      Paolo Bonzini 提交于
      The race between creating the irqchip and the first VCPU is
      currently fixed by checking the presence of an irqchip before
      updating kvm->online_vcpus, and undoing the whole VCPU creation
      if someone created the irqchip in the meanwhile.
      
      Instead, introduce a new field in struct kvm that will count VCPUs
      under a mutex, without the atomic access and memory ordering that we
      need elsewhere to protect the vcpus array.  This also plugs the race
      and is more easily applicable in all similar circumstances.
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c7caebc
  17. 02 6月, 2016 1 次提交
  18. 25 5月, 2016 1 次提交
    • J
      KVM: Create debugfs dir and stat files for each VM · 536a6f88
      Janosch Frank 提交于
      This patch adds a kvm debugfs subdirectory for each VM, which is named
      after its pid and file descriptor. The directories contain the same
      kind of files that are already in the kvm debugfs directory, but the
      data exported through them is now VM specific.
      
      This makes the debugfs kvm data a convenient alternative to the
      tracepoints which already have per VM data. The debugfs data is easy
      to read and low overhead.
      
      CC: Dan Carpenter <dan.carpenter@oracle.com> [includes fixes by Dan Carpenter]
      Signed-off-by: NJanosch Frank <frankja@linux.vnet.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      536a6f88
  19. 19 5月, 2016 2 次提交
  20. 13 5月, 2016 1 次提交
    • C
      KVM: halt_polling: provide a way to qualify wakeups during poll · 3491caf2
      Christian Borntraeger 提交于
      Some wakeups should not be considered a sucessful poll. For example on
      s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
      would be considered runnable - letting all vCPUs poll all the time for
      transactional like workload, even if one vCPU would be enough.
      This can result in huge CPU usage for large guests.
      This patch lets architectures provide a way to qualify wakeups if they
      should be considered a good/bad wakeups in regard to polls.
      
      For s390 the implementation will fence of halt polling for anything but
      known good, single vCPU events. The s390 implementation for floating
      interrupts does a wakeup for one vCPU, but the interrupt will be delivered
      by whatever CPU checks first for a pending interrupt. We prefer the
      woken up CPU by marking the poll of this CPU as "good" poll.
      This code will also mark several other wakeup reasons like IPI or
      expired timers as "good". This will of course also mark some events as
      not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
      we prefer to not poll, unless we are really sure, though.
      
      This patch successfully limits the CPU usage for cases like uperf 1byte
      transactional ping pong workload or wakeup heavy workload like OLTP
      while still providing a proper speedup.
      
      This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
      wakeups that are considered not good for polling.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
      Cc: David Matlack <dmatlack@google.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      [Rename config symbol. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3491caf2
  21. 12 5月, 2016 1 次提交
    • G
      kvm: introduce KVM_MAX_VCPU_ID · 0b1b1dfd
      Greg Kurz 提交于
      The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
      also the upper limit for vCPU ids. This is okay for all archs except PowerPC
      which can have higher ids, depending on the cpu/core/thread topology. In the
      worst case (single threaded guest, host with 8 threads per core), it limits
      the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
      
      This patch separates the vCPU numbering from the total number of vCPUs, with
      the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
      plus one.
      
      The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
      before passing them to KVM_CREATE_VCPU.
      
      This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
      Other archs continue to return KVM_MAX_VCPUS instead.
      Suggested-by: NRadim Krcmar <rkrcmar@redhat.com>
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b1b1dfd
  22. 22 3月, 2016 3 次提交
    • L
      KVM: Replace smp_mb() with smp_load_acquire() in the kvm_flush_remote_tlbs() · 4ae3cb3a
      Lan Tianyu 提交于
      smp_load_acquire() is enough here and it's cheaper than smp_mb().
      Adding a comment about reusing memory barrier of kvm_make_all_cpus_request()
      here to keep order between modifications to the page tables and reading mode.
      Signed-off-by: NLan Tianyu <tianyu.lan@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ae3cb3a
    • L
    • P
      KVM: fix spin_lock_init order on x86 · e9ad4ec8
      Paolo Bonzini 提交于
      Moving the initialization earlier is needed in 4.6 because
      kvm_arch_init_vm is now using mmu_lock, causing lockdep to
      complain:
      
      [  284.440294] INFO: trying to register non-static key.
      [  284.445259] the code is fine but needs lockdep annotation.
      [  284.450736] turning off the locking correctness validator.
      ...
      [  284.528318]  [<ffffffff810aecc3>] lock_acquire+0xd3/0x240
      [  284.533733]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.541467]  [<ffffffff81715581>] _raw_spin_lock+0x41/0x80
      [  284.546960]  [<ffffffffa0305aa0>] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.554707]  [<ffffffffa0305aa0>] kvm_page_track_register_notifier+0x20/0x60 [kvm]
      [  284.562281]  [<ffffffffa02ece70>] kvm_mmu_init_vm+0x20/0x30 [kvm]
      [  284.568381]  [<ffffffffa02dbf7a>] kvm_arch_init_vm+0x1ea/0x200 [kvm]
      [  284.574740]  [<ffffffffa02bff3f>] kvm_dev_ioctl+0xbf/0x4d0 [kvm]
      
      However, it also helps fixing a preexisting problem, which is why this
      patch is also good for stable kernels: kvm_create_vm was incrementing
      current->mm->mm_count but not decrementing it at the out_err label (in
      case kvm_init_mmu_notifier failed).  The new initialization order makes
      it possible to add the required mmdrop without adding a new error label.
      
      Cc: stable@vger.kernel.org
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e9ad4ec8
  23. 09 3月, 2016 1 次提交
    • D
      kvm: cap halt polling at exactly halt_poll_ns · 313f636d
      David Matlack 提交于
      When growing halt-polling, there is no check that the poll time exceeds
      the limit. It's possible for vcpu->halt_poll_ns grow once past
      halt_poll_ns, and stay there until a halt which takes longer than
      vcpu->halt_poll_ns. For example, booting a Linux guest with
      halt_poll_ns=11000:
      
       ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000)
       ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0)
       ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000)
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Fixes: aca6ff29
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      313f636d
  24. 04 3月, 2016 1 次提交
    • P
      KVM: ensure __gfn_to_pfn_memslot initializes *writable · b2740d35
      Paolo Bonzini 提交于
      For the kvm_is_error_hva, ubsan complains if the uninitialized writable
      is passed to __direct_map, even though the value itself is not used
      (__direct_map goes to mmu_set_spte->set_spte->set_mmio_spte but never
      looks at that argument).
      
      Ensuring that __gfn_to_pfn_memslot initializes *writable is cheap and
      avoids this kind of issue.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2740d35
  25. 25 2月, 2016 1 次提交
    • M
      KVM: Use simple waitqueue for vcpu->wq · 8577370f
      Marcelo Tosatti 提交于
      The problem:
      
      On -rt, an emulated LAPIC timer instances has the following path:
      
      1) hard interrupt
      2) ksoftirqd is scheduled
      3) ksoftirqd wakes up vcpu thread
      4) vcpu thread is scheduled
      
      This extra context switch introduces unnecessary latency in the
      LAPIC path for a KVM guest.
      
      The solution:
      
      Allow waking up vcpu thread from hardirq context,
      thus avoiding the need for ksoftirqd to be scheduled.
      
      Normal waitqueues make use of spinlocks, which on -RT
      are sleepable locks. Therefore, waking up a waitqueue
      waiter involves locking a sleeping lock, which
      is not allowed from hard interrupt context.
      
      cyclictest command line:
      
      This patch reduces the average latency in my tests from 14us to 11us.
      
      Daniel writes:
      Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
      benchmark on mainline. The test was run 1000 times on
      tip/sched/core 4.4.0-rc8-01134-g0905f04e:
      
        ./x86-run x86/tscdeadline_latency.flat -cpu host
      
      with idle=poll.
      
      The test seems not to deliver really stable numbers though most of
      them are smaller. Paolo write:
      
      "Anything above ~10000 cycles means that the host went to C1 or
      lower---the number means more or less nothing in that case.
      
      The mean shows an improvement indeed."
      
      Before:
      
                     min             max         mean           std
      count  1000.000000     1000.000000  1000.000000   1000.000000
      mean   5162.596000  2019270.084000  5824.491541  20681.645558
      std      75.431231   622607.723969    89.575700   6492.272062
      min    4466.000000    23928.000000  5537.926500    585.864966
      25%    5163.000000  16132529.750000  5790.132275  16683.745433
      50%    5175.000000  2281919.000000  5834.654000  23151.990026
      75%    5190.000000  2382865.750000  5861.412950  24148.206168
      max    5228.000000  4175158.000000  6254.827300  46481.048691
      
      After
                     min            max         mean           std
      count  1000.000000     1000.00000  1000.000000   1000.000000
      mean   5143.511000  2076886.10300  5813.312474  21207.357565
      std      77.668322   610413.09583    86.541500   6331.915127
      min    4427.000000    25103.00000  5529.756600    559.187707
      25%    5148.000000  1691272.75000  5784.889825  17473.518244
      50%    5160.000000  2308328.50000  5832.025000  23464.837068
      75%    5172.000000  2393037.75000  5853.177675  24223.969976
      max    5222.000000  3922458.00000  6186.720500  42520.379830
      
      [Patch was originaly based on the swait implementation found in the -rt
       tree. Daniel ported it to mainline's version and gathered the
       benchmark numbers for tscdeadline_latency test.]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: linux-rt-users@vger.kernel.org
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8577370f
  26. 23 2月, 2016 1 次提交
  27. 17 2月, 2016 1 次提交