1. 16 4月, 2019 24 次提交
  2. 15 4月, 2019 1 次提交
    • S
      KVM: x86/mmu: Fix an inverted list_empty() check when zapping sptes · cfd32acf
      Sean Christopherson 提交于
      A recently introduced helper for handling zap vs. remote flush
      incorrectly bails early, effectively leaking defunct shadow pages.
      Manifests as a slab BUG when exiting KVM due to the shadow pages
      being alive when their associated cache is destroyed.
      
      ==========================================================================
      BUG kvm_mmu_page_header: Objects remaining in kvm_mmu_page_header on ...
      --------------------------------------------------------------------------
      Disabling lock debugging due to kernel taint
      INFO: Slab 0x00000000fc436387 objects=26 used=23 fp=0x00000000d023caee ...
      CPU: 6 PID: 4315 Comm: rmmod Tainted: G    B             5.1.0-rc2+ #19
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x46/0x5b
       slab_err+0xad/0xd0
       ? on_each_cpu_mask+0x3c/0x50
       ? ksm_migrate_page+0x60/0x60
       ? on_each_cpu_cond_mask+0x7c/0xa0
       ? __kmalloc+0x1ca/0x1e0
       __kmem_cache_shutdown+0x13a/0x310
       shutdown_cache+0xf/0x130
       kmem_cache_destroy+0x1d5/0x200
       kvm_mmu_module_exit+0xa/0x30 [kvm]
       kvm_arch_exit+0x45/0x60 [kvm]
       kvm_exit+0x6f/0x80 [kvm]
       vmx_exit+0x1a/0x50 [kvm_intel]
       __x64_sys_delete_module+0x153/0x1f0
       ? exit_to_usermode_loop+0x88/0xc0
       do_syscall_64+0x4f/0x100
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: a2113634 ("KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfd32acf
  3. 06 4月, 2019 4 次提交
  4. 05 4月, 2019 3 次提交
    • S
      syscalls: Remove start and number from syscall_set_arguments() args · 32d92586
      Steven Rostedt (VMware) 提交于
      After removing the start and count arguments of syscall_get_arguments() it
      seems reasonable to remove them from syscall_set_arguments(). Note, as of
      today, there are no users of syscall_set_arguments(). But we are told that
      there will be soon. But for now, at least make it consistent with
      syscall_get_arguments().
      
      Link: http://lkml.kernel.org/r/20190327222014.GA32540@altlinux.org
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      32d92586
    • S
      syscalls: Remove start and number from syscall_get_arguments() args · b35f549d
      Steven Rostedt (Red Hat) 提交于
      At Linux Plumbers, Andy Lutomirski approached me and pointed out that the
      function call syscall_get_arguments() implemented in x86 was horribly
      written and not optimized for the standard case of passing in 0 and 6 for
      the starting index and the number of system calls to get. When looking at
      all the users of this function, I discovered that all instances pass in only
      0 and 6 for these arguments. Instead of having this function handle
      different cases that are never used, simply rewrite it to return the first 6
      arguments of a system call.
      
      This should help out the performance of tracing system calls by ptrace,
      ftrace and perf.
      
      Link: http://lkml.kernel.org/r/20161107213233.754809394@goodmis.org
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dominik Brodowski <linux@dominikbrodowski.net>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: "Dmitry V. Levin" <ldv@altlinux.org>
      Cc: x86@kernel.org
      Cc: linux-snps-arc@lists.infradead.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: uclinux-h8-devel@lists.sourceforge.jp
      Cc: linux-hexagon@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-mips@vger.kernel.org
      Cc: nios2-dev@lists.rocketboards.org
      Cc: openrisc@lists.librecores.org
      Cc: linux-parisc@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-riscv@lists.infradead.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: sparclinux@vger.kernel.org
      Cc: linux-um@lists.infradead.org
      Cc: linux-xtensa@linux-xtensa.org
      Cc: linux-arch@vger.kernel.org
      Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
      Acked-by: Max Filippov <jcmvbkbc@gmail.com> # For xtensa changes
      Acked-by: Will Deacon <will.deacon@arm.com> # For the arm64 bits
      Reviewed-by: Thomas Gleixner <tglx@linutronix.de> # for x86
      Reviewed-by: NDmitry V. Levin <ldv@altlinux.org>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b35f549d
    • D
      xen: Prevent buffer overflow in privcmd ioctl · 42d8644b
      Dan Carpenter 提交于
      The "call" variable comes from the user in privcmd_ioctl_hypercall().
      It's an offset into the hypercall_page[] which has (PAGE_SIZE / 32)
      elements.  We need to put an upper bound on it to prevent an out of
      bounds access.
      
      Cc: stable@vger.kernel.org
      Fixes: 1246ae0b ("xen: add variable hypercall caller")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      42d8644b
  5. 29 3月, 2019 8 次提交
    • M
      x86/realmode: Make set_real_mode_mem() static inline · f560bd19
      Matteo Croce 提交于
      Remove the unused @size argument and move it into a header file, so it
      can be inlined.
      
       [ bp: Massage. ]
      Signed-off-by: NMatteo Croce <mcroce@redhat.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-efi <linux-efi@vger.kernel.org>
      Cc: platform-driver-x86@vger.kernel.org
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190328114233.27835-1-mcroce@redhat.com
      f560bd19
    • S
      KVM: x86: update %rip after emulating IO · 45def77e
      Sean Christopherson 提交于
      Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
      OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
      vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
      that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
      
      To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM:
      Skip pio instruction when it is emulated, not executed") changed the
      behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
      instruction prior to exiting to userspace.  Full emulation doesn't need
      such tricks becase re-emulating the instruction will naturally handle
      %rip being changed to point at the reset vector.
      
      Updating %rip prior to executing to userspace has several drawbacks:
      
        - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
          fails it will likely yell about the wrong address.
        - Single step exits to userspace for are effectively dropped as
          KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
        - Behavior of PIO emulation is different depending on whether it
          goes down the fast path or the slow path.
      
      Rather than skip the PIO instruction before exiting to userspace,
      snapshot the linear %rip and cancel PIO completion if the current
      value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
      common scenario, the snapshot and comparison has negligible overhead
      as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
      VMREAD in this case.
      
      All other alternatives to snapshotting the linear %rip that don't
      rely on an explicit reset announcenment suffer from one corner case
      or another.  For example, canceling PIO completion on any write to
      %rip fails if userspace does a save/restore of %rip, and attempting to
      avoid that issue by canceling PIO only if %rip changed then fails if PIO
      collides with the reset %rip.  Attempting to zero in on the exact reset
      vector won't work for APs, which means adding more hooks such as the
      vCPU's MP_STATE, and so on and so forth.
      
      Checking for a linear %rip match technically suffers from corner cases,
      e.g. userspace could theoretically rewrite the underlying code page and
      expect a different instruction to execute, or the guest hardcodes a PIO
      reset at 0xfffffff0, but those are far, far outside of what can be
      considered normal operation.
      
      Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45def77e
    • V
      x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init · 013cc6eb
      Vitaly Kuznetsov 提交于
      When userspace initializes guest vCPUs it may want to zero all supported
      MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/
      HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5 ("kvm/x86: Update SynIC
      timers on guest entry only") we began doing stimer_mark_pending()
      unconditionally on every config change.
      
      The issue I'm observing manifests itself as following:
      - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as
        pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER;
      - kvm_hv_has_stimer_pending() starts returning true;
      - kvm_vcpu_has_events() starts returning true;
      - kvm_arch_vcpu_runnable() starts returning true;
      - when kvm_arch_vcpu_ioctl_run() gets into
        (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case:
        - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns
          immediately, avoiding normal wait path;
        - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing
          userspace to retry.
      
      So instead of normal wait path we get a busy loop on all secondary vCPUs
      before they get INIT signal. This seems to be undesirable, especially given
      that this happens even when Hyper-V extensions are not used.
      
      Generally, it seems to be pointless to mark an stimer as pending in
      stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing
      kvm_hv_process_stimers() will do is clear the corresponding bit. We may
      just not mark disabled timers as pending instead.
      
      Fixes: f3b138c5 ("kvm/x86: Update SynIC timers on guest entry only")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      013cc6eb
    • X
      kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs · 2bdb76c0
      Xiaoyao Li 提交于
      Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if
      host doesn't suppot it. We should move it to array emulated_msrs from
      arry msrs_to_save, to report to userspace that guest support this msr.
      Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bdb76c0
    • S
      KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts · 0cf9135b
      Sean Christopherson 提交于
      The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host
      userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES
      regardless of hardware support under the pretense that KVM fully
      emulates MSR_IA32_ARCH_CAPABILITIES.  Unfortunately, only VMX hosts
      handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS
      also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts).
      
      Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so
      that it's emulated on AMD hosts.
      
      Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported")
      Cc: stable@vger.kernel.org
      Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0cf9135b
    • B
      kvm: mmu: Used range based flushing in slot_handle_level_range · f285c633
      Ben Gardon 提交于
      Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address
      in slot_handle_level_range. When range based flushes are not enabled
      kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs.
      
      This changes the behavior of many functions that indirectly use
      slot_handle_level_range, iff the range based flushes are enabled. The
      only potential problem I see with this is that kvm->tlbs_dirty will be
      cleared less often, however the only caller of slot_handle_level_range that
      checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which
      checks it and does a kvm_flush_remote_tlbs after calling
      kvm_unmap_hva_range anyway.
      
      Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and
      	without this patch. The patch introduced no new failures.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f285c633
    • W
      KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() · 4d66623c
      Wei Yang 提交于
      * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is
        non-zero.
      
      * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages()
        never return zero.
      
      Based on these two reasons, we can merge the two *if* clause and use the
      return value from kvm_mmu_calculate_mmu_pages() directly. This simplify
      the code and also eliminate the possibility for reader to believe
      nr_mmu_pages would be zero.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d66623c
    • K
      kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields · 711eff3a
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check is performed on vmentry of L2 guests:
      
          On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP
          field and the IA32_SYSENTER_EIP field must each contain a canonical
          address.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      711eff3a