1. 26 5月, 2018 1 次提交
  2. 25 5月, 2018 1 次提交
    • C
      KVM: arm/arm64: Introduce kvm_arch_vcpu_run_pid_change · bd2a6394
      Christoffer Dall 提交于
      KVM/ARM differs from other architectures in having to maintain an
      additional virtual address space from that of the host and the
      guest, because we split the execution of KVM across both EL1 and
      EL2.
      
      This results in a need to explicitly map data structures into EL2
      (hyp) which are accessed from the hyp code.  As we are about to be
      more clever with our FPSIMD handling on arm64, which stores data in
      the task struct and uses thread_info flags, we will have to map
      parts of the currently executing task struct into the EL2 virtual
      address space.
      
      However, we don't want to do this on every KVM_RUN, because it is a
      fairly expensive operation to walk the page tables, and the common
      execution mode is to map a single thread to a VCPU.  By introducing
      a hook that architectures can select with
      HAVE_KVM_VCPU_RUN_PID_CHANGE, we do not introduce overhead for
      other architectures, but have a simple way to only map the data we
      need when required for arm64.
      
      This patch introduces the framework only, and wires it up in the
      arm/arm64 KVM common code.
      
      No functional change.
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      Signed-off-by: NDave Martin <Dave.Martin@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      bd2a6394
  3. 07 3月, 2018 1 次提交
  4. 24 2月, 2018 1 次提交
    • W
      KVM: mmu: Fix overlap between public and private memslots · b28676bb
      Wanpeng Li 提交于
      Reported by syzkaller:
      
          pte_list_remove: ffff9714eb1f8078 0->BUG
          ------------[ cut here ]------------
          kernel BUG at arch/x86/kvm/mmu.c:1157!
          invalid opcode: 0000 [#1] SMP
          RIP: 0010:pte_list_remove+0x11b/0x120 [kvm]
          Call Trace:
           drop_spte+0x83/0xb0 [kvm]
           mmu_page_zap_pte+0xcc/0xe0 [kvm]
           kvm_mmu_prepare_zap_page+0x81/0x4a0 [kvm]
           kvm_mmu_invalidate_zap_all_pages+0x159/0x220 [kvm]
           kvm_arch_flush_shadow_all+0xe/0x10 [kvm]
           kvm_mmu_notifier_release+0x6c/0xa0 [kvm]
           ? kvm_mmu_notifier_release+0x5/0xa0 [kvm]
           __mmu_notifier_release+0x79/0x110
           ? __mmu_notifier_release+0x5/0x110
           exit_mmap+0x15a/0x170
           ? do_exit+0x281/0xcb0
           mmput+0x66/0x160
           do_exit+0x2c9/0xcb0
           ? __context_tracking_exit.part.5+0x4a/0x150
           do_group_exit+0x50/0xd0
           SyS_exit_group+0x14/0x20
           do_syscall_64+0x73/0x1f0
           entry_SYSCALL64_slow_path+0x25/0x25
      
      The reason is that when creates new memslot, there is no guarantee for new
      memslot not overlap with private memslots. This can be triggered by the
      following program:
      
         #include <fcntl.h>
         #include <pthread.h>
         #include <setjmp.h>
         #include <signal.h>
         #include <stddef.h>
         #include <stdint.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/ioctl.h>
         #include <sys/stat.h>
         #include <sys/syscall.h>
         #include <sys/types.h>
         #include <unistd.h>
         #include <linux/kvm.h>
      
         long r[16];
      
         int main()
         {
      	void *p = valloc(0x4000);
      
      	r[2] = open("/dev/kvm", 0);
      	r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
      
      	uint64_t addr = 0xf000;
      	ioctl(r[3], KVM_SET_IDENTITY_MAP_ADDR, &addr);
      	r[6] = ioctl(r[3], KVM_CREATE_VCPU, 0x0ul);
      	ioctl(r[3], KVM_SET_TSS_ADDR, 0x0ul);
      	ioctl(r[6], KVM_RUN, 0);
      	ioctl(r[6], KVM_RUN, 0);
      
      	struct kvm_userspace_memory_region mr = {
      		.slot = 0,
      		.flags = KVM_MEM_LOG_DIRTY_PAGES,
      		.guest_phys_addr = 0xf000,
      		.memory_size = 0x4000,
      		.userspace_addr = (uintptr_t) p
      	};
      	ioctl(r[3], KVM_SET_USER_MEMORY_REGION, &mr);
      	return 0;
         }
      
      This patch fixes the bug by not adding a new memslot even if it
      overlaps with private memslots.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers3@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      ---
       virt/kvm/kvm_main.c | 3 +--
       1 file changed, 1 insertion(+), 2 deletions(-)
      b28676bb
  5. 01 2月, 2018 3 次提交
    • D
      mm, mmu_notifier: annotate mmu notifiers with blockable invalidate callbacks · 5ff7091f
      David Rientjes 提交于
      Commit 4d4bbd85 ("mm, oom_reaper: skip mm structs with mmu
      notifiers") prevented the oom reaper from unmapping private anonymous
      memory with the oom reaper when the oom victim mm had mmu notifiers
      registered.
      
      The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
      around the unmap_page_range(), which is needed, can block and the oom
      killer will stall forever waiting for the victim to exit, which may not
      be possible without reaping.
      
      That concern is real, but only true for mmu notifiers that have
      blockable invalidate_range_{start,end}() callbacks.  This patch adds a
      "flags" field to mmu notifier ops that can set a bit to indicate that
      these callbacks do not block.
      
      The implementation is steered toward an expensive slowpath, such as
      after the oom reaper has grabbed mm->mmap_sem of a still alive oom
      victim.
      
      [rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
      [akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NDimitri Sivanich <sivanich@hpe.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Sean Hefty <sean.hefty@intel.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ff7091f
    • M
      kvm: embed vcpu id to dentry of vcpu anon inode · e46b4692
      Masatake YAMATO 提交于
      All d-entries for vcpu have the same, "anon_inode:kvm-vcpu". That means
      it is impossible to know the mapping between fds for vcpu and vcpu
      from userland.
      
          # LC_ALL=C ls -l /proc/617/fd | grep vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu
      
      It is also impossible to know the mapping between vma for kvm_run
      structure and vcpu from userland.
      
          # LC_ALL=C grep vcpu /proc/617/maps
          7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
          7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu
      
      This change adds vcpu id to d-entries for vcpu. With this change
      you can get the following output:
      
          # LC_ALL=C ls -l /proc/617/fd | grep vcpu
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 18 -> anon_inode:kvm-vcpu:0
          lrwx------. 1 qemu qemu 64 Jan  7 16:50 19 -> anon_inode:kvm-vcpu:1
      
          # LC_ALL=C grep vcpu /proc/617/maps
          7f9d842d0000-7f9d842d3000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:0
          7f9d842d3000-7f9d842d6000 rw-s 00000000 00:0d 20393                      anon_inode:kvm-vcpu:1
      
      With the mappings known from the output, a tool like strace can report more details
      of qemu-kvm process activities. Here is the strace output of my local prototype:
      
          # ./strace -KK -f -p 617 2>&1 | grep 'KVM_RUN\| K'
          ...
          [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
           K ready_for_interrupt_injection=1, if_flag=0, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
           K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
          [pid   664] ioctl(18, KVM_RUN, 0)       = 0 (KVM_EXIT_MMIO)
           K ready_for_interrupt_injection=1, if_flag=1, flags=0, cr8=0000000000000000, apic_base=0x000000fee00d00
           K phys_addr=0, len=1634035803, [33, 0, 0, 0, 0, 0, 0, 0], is_write=112
          ...
      Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      e46b4692
    • K
      kvm: Map PFN-type memory regions as writable (if possible) · a340b3e2
      KarimAllah Ahmed 提交于
      For EPT-violations that are triggered by a read, the pages are also mapped with
      write permissions (if their memory region is also writable). That would avoid
      getting yet another fault on the same page when a write occurs.
      
      This optimization only happens when you have a "struct page" backing the memory
      region. So also enable it for memory regions that do not have a "struct page".
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a340b3e2
  6. 16 1月, 2018 1 次提交
  7. 14 12月, 2017 16 次提交
  8. 06 12月, 2017 1 次提交
  9. 05 12月, 2017 1 次提交
  10. 28 11月, 2017 1 次提交
    • J
      KVM: Let KVM_SET_SIGNAL_MASK work as advertised · 20b7035c
      Jan H. Schönherr 提交于
      KVM API says for the signal mask you set via KVM_SET_SIGNAL_MASK, that
      "any unblocked signal received [...] will cause KVM_RUN to return with
      -EINTR" and that "the signal will only be delivered if not blocked by
      the original signal mask".
      
      This, however, is only true, when the calling task has a signal handler
      registered for a signal. If not, signal evaluation is short-circuited for
      SIG_IGN and SIG_DFL, and the signal is either ignored without KVM_RUN
      returning or the whole process is terminated.
      
      Make KVM_SET_SIGNAL_MASK behave as advertised by utilizing logic similar
      to that in do_sigtimedwait() to avoid short-circuiting of signals.
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      20b7035c
  11. 09 11月, 2017 1 次提交
  12. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  13. 12 10月, 2017 1 次提交
    • S
      kvm, mm: account kvm related kmem slabs to kmemcg · 46bea48a
      Shakeel Butt 提交于
      The kvm slabs can consume a significant amount of system memory
      and indeed in our production environment we have observed that
      a lot of machines are spending significant amount of memory that
      can not be left as system memory overhead. Also the allocations
      from these slabs can be triggered directly by user space applications
      which has access to kvm and thus a buggy application can leak
      such memory. So, these caches should be accounted to kmemcg.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      46bea48a
  14. 20 9月, 2017 1 次提交
  15. 15 9月, 2017 1 次提交
  16. 13 9月, 2017 1 次提交
    • C
      KVM: fix rcu warning on VM_CREATE errors · 021086e3
      Christian Borntraeger 提交于
      commit 3898da94 ("KVM: avoid using rcu_dereference_protected") can
      trigger the following lockdep/rcu splat if the VM_CREATE ioctl fails,
      for example if kvm_arch_init_vm fails:
      
      WARNING: suspicious RCU usage
      4.13.0+ #105 Not tainted
      -----------------------------
      ./include/linux/kvm_host.h:481 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      no locks held by qemu-system-s39/79.
      stack backtrace:
      CPU: 0 PID: 79 Comm: qemu-system-s39 Not tainted 4.13.0+ #105
      Hardware name: IBM 2964 NC9 704 (KVM/Linux)
      Call Trace:
      ([<00000000001140b2>] show_stack+0xea/0xf0)
       [<00000000008a68a4>] dump_stack+0x94/0xd8
       [<0000000000134c12>] kvm_dev_ioctl+0x372/0x7a0
       [<000000000038f940>] do_vfs_ioctl+0xa8/0x6c8
       [<0000000000390004>] SyS_ioctl+0xa4/0xb8
       [<00000000008c7a8c>] system_call+0xc4/0x27c
      no locks held by qemu-system-s39/79.
      
      We have to reset the just created users_count back to 0 to
      tell the check to not trigger.
      Reported-by: NStefan Haberland <sth@linux.vnet.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Fixes: 3898da94 ("KVM: avoid using rcu_dereference_protected")
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      021086e3
  17. 01 9月, 2017 1 次提交
  18. 16 8月, 2017 1 次提交
    • A
      kvm: avoid uninitialized-variable warnings · 076b925d
      Arnd Bergmann 提交于
      When PAGE_OFFSET is not a compile-time constant, we run into
      warnings from the use of kvm_is_error_hva() that the compiler
      cannot optimize out:
      
      arch/arm/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_gfn_to_hva_cache_init':
      arch/arm/kvm/../../../virt/kvm/kvm_main.c:1978:14: error: 'nr_pages_avail' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      arch/arm/kvm/../../../virt/kvm/kvm_main.c: In function 'gfn_to_page_many_atomic':
      arch/arm/kvm/../../../virt/kvm/kvm_main.c:1660:5: error: 'entry' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This adds fake initializations to the two instances I ran into.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      076b925d
  19. 08 8月, 2017 1 次提交
    • L
      KVM: add spinlock optimization framework · 199b5763
      Longpeng(Mike) 提交于
      If a vcpu exits due to request a user mode spinlock, then
      the spinlock-holder may be preempted in user mode or kernel mode.
      (Note that not all architectures trap spin loops in user mode,
      only AMD x86 and ARM/ARM64 currently do).
      
      But if a vcpu exits in kernel mode, then the holder must be
      preempted in kernel mode, so we should choose a vcpu in kernel mode
      as a more likely candidate for the lock holder.
      
      This introduces kvm_arch_vcpu_in_kernel() to decide whether the
      vcpu is in kernel-mode when it's preempted.  kvm_vcpu_on_spin's
      new argument says the same of the spinning VCPU.
      Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      199b5763
  20. 03 8月, 2017 1 次提交
  21. 27 7月, 2017 1 次提交
  22. 13 7月, 2017 1 次提交
    • C
      KVM: trigger uevents when creating or destroying a VM · 286de8f6
      Claudio Imbrenda 提交于
      This patch adds a few lines to the KVM common code to fire a
      KOBJ_CHANGE uevent whenever a KVM VM is created or destroyed. The event
      carries five environment variables:
      
      CREATED indicates how many times a new VM has been created. It is
      	useful for example to trigger specific actions when the first
      	VM is started
      COUNT indicates how many VMs are currently active. This can be used for
      	logging or monitoring purposes
      PID has the pid of the KVM process that has been started or stopped.
      	This can be used to perform process-specific tuning.
      STATS_PATH contains the path in debugfs to the directory with all the
      	runtime statistics for this VM. This is useful for performance
      	monitoring and profiling.
      EVENT described the type of event, its value can be either "create" or
      	"destroy"
      
      Specific udev rules can be then set up in userspace to deal with the
      creation or destruction of VMs as needed.
      Signed-off-by: NClaudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      286de8f6
  23. 10 7月, 2017 1 次提交