1. 08 5月, 2015 2 次提交
  2. 07 5月, 2015 12 次提交
    • P
      KVM: x86: dump VMCS on invalid entry · 4eb64dce
      Paolo Bonzini 提交于
      Code and format roughly based on Xen's vmcs_dump_vcpu.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4eb64dce
    • M
      x86: kvmclock: drop rdtsc_barrier() · a3eb97bd
      Marcelo Tosatti 提交于
      Drop unnecessary rdtsc_barrier(), as has been determined empirically,
      see 057e6a8c for details.
      
      Noticed by Andy Lutomirski.
      
      Improves clock_gettime() by approximately 15% on
      Intel i7-3520M @ 2.90GHz.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a3eb97bd
    • J
      KVM: x86: drop unneeded null test · d90e3a35
      Julia Lawall 提交于
      If the null test is needed, the call to cancel_delayed_work_sync would have
      already crashed.  Normally, the destroy function should only be called
      if the init function has succeeded, in which case ioapic is not null.
      
      Problem found using Coccinelle.
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d90e3a35
    • R
      KVM: x86: fix initial PAT value · 74545705
      Radim Krčmář 提交于
      PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
      VMX used a wrong value (host's PAT) and while SVM used the right one,
      it never got to arch.pat.
      
      This is not an issue with QEMU as it will force the correct value.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74545705
    • R
      kvm,x86: load guest FPU context more eagerly · 653f52c3
      Rik van Riel 提交于
      Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
      re-load them every time the guest accesses the FPU after a switch back into
      the guest from the host.
      
      This patch copies the x86 task switch semantics for FPU loading, with the
      FPU loaded eagerly after first use if the system uses eager fpu mode,
      or if the guest uses the FPU frequently.
      
      In the latter case, after loading the FPU for 255 times, the fpu_counter
      will roll over, and we will revert to loading the FPU on demand, until
      it has been established that the guest is still actively using the FPU.
      
      This mirrors the x86 task switch policy, which seems to work.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      653f52c3
    • J
      kvm: x86: Deliver MSI IRQ to only lowest prio cpu if msi_redir_hint is true · d1ebdbf9
      James Sullivan 提交于
      An MSI interrupt should only be delivered to the lowest priority CPU
      when it has RH=1, regardless of the delivery mode. Modified
      kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
      or irq->msi_redir_hint.
      
      Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
      kvm_lowest_prio_delivery().
      
      Changed a check in kvm_irq_delivery_to_apic_fast() from
      irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1ebdbf9
    • J
      kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery · 93bbf0b8
      James Sullivan 提交于
      Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
      be used to determine if the delivery of the MSI should target only
      the lowest priority CPU in the logical group specified for delivery.
      (In physical dest mode, the RH bit is not relevant). Initialized the value
      of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
      to false in all other cases.
      
      Added value of msi_redir_hint to a debug message dump of an IRQ in
      apic_send_ipi().
      Signed-off-by: NJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93bbf0b8
    • P
      KVM: x86: tweak types of fields in kvm_lapic_irq · b7cb2231
      Paolo Bonzini 提交于
      Change to u16 if they only contain data in the low 16 bits.
      
      Change the level field to bool, since we assign 1 sometimes, but
      just mask icr_low with APIC_INT_ASSERT in apic_send_ipi.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7cb2231
    • N
      KVM: x86: INIT and reset sequences are different · d28bc9dd
      Nadav Amit 提交于
      x86 architecture defines differences between the reset and INIT sequences.
      INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
      MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
      
      References (from Intel SDM):
      
      "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
      to a specific processor or system wide) do not cause the MP protocol to be
      repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
      
      [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
      
      "If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
      changed." [9.2: X87 FPU INITIALIZATION]
      
      "The state of the local APIC following an INIT reset is the same as it is after
      a power-up or hardware reset, except that the APIC ID and arbitration ID
      registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
      ("Wait-for-SIPI" State)]
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d28bc9dd
    • N
      KVM: x86: Support for disabling quirks · 90de4a18
      Nadav Amit 提交于
      Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
      created in order to overcome QEMU issues. Those issue were mostly result of
      invalid VM BIOS.  Currently there are two quirks that can be disabled:
      
      1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
      2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot
      
      These two issues are already resolved in recent releases of QEMU, and would
      therefore be disabled by QEMU.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
      [Report capability from KVM_CHECK_EXTENSION too. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90de4a18
    • C
      KVM: arm/mips/x86/power use __kvm_guest_{enter|exit} · ccf73aaf
      Christian Borntraeger 提交于
      Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
      where interrupts are disabled.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ccf73aaf
    • L
      kvmclock: set scheduler clock stable · ff7bbb9c
      Luiz Capitulino 提交于
      If you try to enable NOHZ_FULL on a guest today, you'll get
      the following error when the guest tries to deactivate the
      scheduler tick:
      
       WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
       NO_HZ FULL will not work with unstable sched clock
       CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb7 #204
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       Workqueue: events flush_to_ldisc
        ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
        ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
        0000000000000000 0000000000000003 0000000000000001 0000000000000001
       Call Trace:
        <IRQ>  [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
        [<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
        [<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
        [<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
        [<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
        [<ffffffff810511c5>] irq_exit+0xc5/0x130
        [<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
        [<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
        <EOI>  [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
        [<ffffffff8108bbc8>] __wake_up+0x48/0x60
        [<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
        [<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
        [<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
        [<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
        [<ffffffff81064d05>] process_one_work+0x1d5/0x540
        [<ffffffff81064c81>] ? process_one_work+0x151/0x540
        [<ffffffff81065191>] worker_thread+0x121/0x470
        [<ffffffff81065070>] ? process_one_work+0x540/0x540
        [<ffffffff8106b4df>] kthread+0xef/0x110
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
        [<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
       ---[ end trace 06e3507544a38866 ]---
      
      However, it turns out that kvmclock does provide a stable
      sched_clock callback. So, let the scheduler know this which
      in turn makes NOHZ_FULL work in the guest.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff7bbb9c
  3. 27 4月, 2015 3 次提交
    • P
      x86: pvclock: Really remove the sched notifier for cross-cpu migrations · 73459e2a
      Paolo Bonzini 提交于
      This reverts commits 0a4e6be9
      and 80f7fdb1.
      
      The task migration notifier was originally introduced in order to support
      the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
      with synchronized TSC.  Hence, on KVM the race condition is only needed
      due to a bad implementation on the host side, and even then it's so rare
      that it's mostly theoretical.
      
      As far as KVM is concerned it's possible to fix the host, avoiding the
      additional complexity in the vDSO and the (re)introduction of the task
      migration notifier.
      
      Xen, on the other hand, hasn't yet implemented vsyscall support at
      all, so we do not care about its plans for non-synchronized TSC.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73459e2a
    • R
      kvm: x86: fix kvmclock update protocol · 5dca0d91
      Radim Krčmář 提交于
      The kvmclock spec says that the host will increment a version field to
      an odd number, then update stuff, then increment it to an even number.
      The host is buggy and doesn't do this, and the result is observable
      when one vcpu reads another vcpu's kvmclock data.
      
      There's no good way for a guest kernel to keep its vdso from reading
      a different vcpu's kvmclock data, but we don't need to care about
      changing VCPUs as long as we read a consistent data from kvmclock.
      (VCPU can change outside of this loop too, so it doesn't matter if we
      return a value not fit for this VCPU.)
      
      Based on a patch by Radim Krčmář.
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Acked-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5dca0d91
    • A
      x86_64, asm: Work around AMD SYSRET SS descriptor attribute issue · 61f01dd9
      Andy Lutomirski 提交于
      AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET with
      SS == 0 results in an invalid usermode state in which SS is apparently
      equal to __USER_DS but causes #SS if used.
      
      Work around the issue by setting SS to __KERNEL_DS __switch_to, thus
      ensuring that SYSRET never happens with SS set to NULL.
      
      This was exposed by a recent vDSO cleanup.
      
      Fixes: e7d6eefa x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f01dd9
  4. 24 4月, 2015 2 次提交
    • L
      x86: fix special __probe_kernel_write() tail zeroing case · d869844b
      Linus Torvalds 提交于
      Commit cae2a173 ("x86: clean up/fix 'copy_in_user()' tail zeroing")
      fixed the failure case tail zeroing of one special case of the x86-64
      generic user-copy routine, namely when used for the user-to-user case
      ("copy_in_user()").
      
      But in the process it broke an even more unusual case: using the user
      copy routine for kernel-to-kernel copying.
      
      Now, normally kernel-kernel copies are obviously done using memcpy(),
      but we have a couple of special cases when we use the user-copy
      functions.  One is when we pass a kernel buffer to a regular user-buffer
      routine, using set_fs(KERNEL_DS).  That's a "normal" case, and continued
      to work fine, because it never takes any faults (with the possible
      exception of a silent and successful vmalloc fault).
      
      But Jan Beulich pointed out another, very unusual, special case: when we
      use the user-copy routines not because it's a path that expects a user
      pointer, but for a couple of ftrace/kgdb cases that want to do a kernel
      copy, but do so using "unsafe" buffers, and use the user-copy routine to
      gracefully handle faults.  IOW, for probe_kernel_write().
      
      And that broke for the case of a faulting kernel destination, because we
      saw the kernel destination and wanted to try to clear the tail of the
      buffer.  Which doesn't work, since that's what faults.
      
      This only triggers for things like kgdb and ftrace users (eg trying
      setting a breakpoint on read-only memory), but it's definitely a bug.
      The fix is to not compare against the kernel address start (TASK_SIZE),
      but instead use the same limits "access_ok()" uses.
      Reported-and-tested-by: NJan Beulich <jbeulich@suse.com>
      Cc: stable@vger.kernel.org # 4.0
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d869844b
    • A
      crypto: x86/sha512_ssse3 - fixup for asm function prototype change · 00425bb1
      Ard Biesheuvel 提交于
      Patch e68410eb ("crypto: x86/sha512_ssse3 - move SHA-384/512
      SSSE3 implementation to base layer") changed the prototypes of the
      core asm SHA-512 implementations so that they are compatible with
      the prototype used by the base layer.
      
      However, in one instance, the register that was used for passing the
      input buffer was reused as a scratch register later on in the code,
      and since the input buffer param changed places with the digest param
      -which needs to be written back before the function returns- this
      resulted in the scratch register to be dereferenced in a memory write
      operation, causing a GPF.
      
      Fix this by changing the scratch register to use the same register as
      the input buffer param again.
      
      Fixes: e68410eb ("crypto: x86/sha512_ssse3 - move SHA-384/512 SSSE3 implementation to base layer")
      Reported-By: NBobby Powers <bobbypowers@gmail.com>
      Tested-By: NBobby Powers <bobbypowers@gmail.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      00425bb1
  5. 22 4月, 2015 1 次提交
    • B
      KVM: VMX: Preserve host CR4.MCE value while in guest mode. · 085e68ee
      Ben Serebrin 提交于
      The host's decision to enable machine check exceptions should remain
      in force during non-root mode.  KVM was writing 0 to cr4 on VCPU reset
      and passed a slightly-modified 0 to the vmcs.guest_cr4 value.
      
      Tested: Built.
      On earlier version, tested by injecting machine check
      while a guest is spinning.
      
      Before the change, if guest CR4.MCE==0, then the machine check is
      escalated to Catastrophic Error (CATERR) and the machine dies.
      If guest CR4.MCE==1, then the machine check causes VMEXIT and is
      handled normally by host Linux. After the change, injecting a machine
      check causes normal Linux machine check handling.
      Signed-off-by: NBen Serebrin <serebrin@google.com>
      Reviewed-by: NVenkatesh Srinivas <venkateshs@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      085e68ee
  6. 19 4月, 2015 1 次提交
  7. 18 4月, 2015 2 次提交
  8. 17 4月, 2015 6 次提交
  9. 16 4月, 2015 3 次提交
    • O
      x86/ptrace: Fix the TIF_FORCED_TF logic in handle_signal() · fd0f86b6
      Oleg Nesterov 提交于
      When the TIF_SINGLESTEP tracee dequeues a signal,
      handle_signal() clears TIF_FORCED_TF and X86_EFLAGS_TF but
      leaves TIF_SINGLESTEP set.
      
      If the tracer does PTRACE_SINGLESTEP again, enable_single_step()
      sets X86_EFLAGS_TF but not TIF_FORCED_TF.  This means that the
      subsequent PTRACE_CONT doesn't not clear X86_EFLAGS_TF, and the
      tracee gets the wrong SIGTRAP.
      
      Test-case (needs -O2 to avoid prologue insns in signal handler):
      
      	#include <unistd.h>
      	#include <stdio.h>
      	#include <sys/ptrace.h>
      	#include <sys/wait.h>
      	#include <sys/user.h>
      	#include <assert.h>
      	#include <stddef.h>
      
      	void handler(int n)
      	{
      		asm("nop");
      	}
      
      	int child(void)
      	{
      		assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      		signal(SIGALRM, handler);
      		kill(getpid(), SIGALRM);
      		return 0x23;
      	}
      
      	void *getip(int pid)
      	{
      		return (void*)ptrace(PTRACE_PEEKUSER, pid,
      					offsetof(struct user, regs.rip), 0);
      	}
      
      	int main(void)
      	{
      		int pid, status;
      
      		pid = fork();
      		if (!pid)
      			return child();
      
      		assert(wait(&status) == pid);
      		assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGALRM);
      
      		assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
      		assert(wait(&status) == pid);
      		assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
      		assert((getip(pid) - (void*)handler) == 0);
      
      		assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
      		assert(wait(&status) == pid);
      		assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
      		assert((getip(pid) - (void*)handler) == 1);
      
      		assert(ptrace(PTRACE_CONT, pid, 0,0) == 0);
      		assert(wait(&status) == pid);
      		assert(WIFEXITED(status) && WEXITSTATUS(status) == 0x23);
      
      		return 0;
      	}
      
      The last assert() fails because PTRACE_CONT wrongly triggers
      another single-step and X86_EFLAGS_TF can't be cleared by
      debugger until the tracee does sys_rt_sigreturn().
      
      Change handle_signal() to do user_disable_single_step() if
      stepping, we do not need to preserve TIF_SINGLESTEP because we
      are going to do ptrace_notify(), and it is simply wrong to leak
      this bit.
      
      While at it, change the comment to explain why we also need to
      clear TF unconditionally after setup_rt_frame().
      
      Note: in the longer term we should probably change
      setup_sigcontext() to use get_flags() and then just remove this
      user_disable_single_step().  And, the state of TIF_FORCED_TF can
      be wrong after restore_sigcontext() which can set/clear TF, this
      needs another fix.
      
      This fix fixes the 'single_step_syscall_32' testcase in
      the x86 testsuite:
      
      Before:
      
      	~/linux/tools/testing/selftests/x86> ./single_step_syscall_32
      	[RUN]   Set TF and check nop
      	[OK]    Survived with TF set and 9 traps
      	[RUN]   Set TF and check int80
      	[OK]    Survived with TF set and 9 traps
      	[RUN]   Set TF and check a fast syscall
      	[WARN]  Hit 10000 SIGTRAPs with si_addr 0xf7789cc0, ip 0xf7789cc0
      	Trace/breakpoint trap (core dumped)
      
      After:
      
      	~/linux/linux/tools/testing/selftests/x86> ./single_step_syscall_32
      	[RUN]   Set TF and check nop
      	[OK]    Survived with TF set and 9 traps
      	[RUN]   Set TF and check int80
      	[OK]    Survived with TF set and 9 traps
      	[RUN]   Set TF and check a fast syscall
      	[OK]    Survived with TF set and 39 traps
      	[RUN]   Fast syscall with TF cleared
      	[OK]    Nothing unexpected happened
      Reported-by: NEvan Teran <eteran@alum.rit.edu>
      Reported-by: NPedro Alves <palves@redhat.com>
      Tested-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Added x86 self-test info. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fd0f86b6
    • J
      x86: mtrr: if: remove use of seq_printf return value · 3ac62bc0
      Joe Perches 提交于
      The seq_printf return value, because it's frequently misused,
      will eventually be converted to void.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ac62bc0
    • D
      VFS: assorted d_backing_inode() annotations · bb668734
      David Howells 提交于
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bb668734
  10. 15 4月, 2015 8 次提交
    • X
      KVM: MMU: fix comment in kvm_mmu_zap_collapsible_spte · decf6333
      Xiao Guangrong 提交于
      Soft mmu uses direct shadow page to fill guest large mapping with small
      pages if huge mapping is disallowed on host. So zapping direct shadow
      page works well both for soft mmu and hard mmu, it's just less widely
      applicable.
      
      Fix the comment to reflect this.
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Message-Id: <552C91BA.1010703@linux.intel.com>
      [Fix comment wording further. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      decf6333
    • W
      kvm: mmu: don't do memslot overflow check · 13000523
      Wanpeng Li 提交于
      As Andres pointed out:
      
      | I don't understand the value of this check here. Are we looking for a
      | broken memslot? Shouldn't this be a BUG_ON? Is this the place to care
      | about these things? npages is capped to KVM_MEM_MAX_NR_PAGES, i.e.
      | 2^31. A 64 bit overflow would be caused by a gigantic gfn_start which
      | would be trouble in many other ways.
      
      This patch drops the memslot overflow check to make the codes more simple.
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Message-Id: <1429064694-3072-1-git-send-email-wanpeng.li@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      13000523
    • V
      mm: move memtest under mm · 4a20799d
      Vladimir Murzin 提交于
      Memtest is a simple feature which fills the memory with a given set of
      patterns and validates memory contents, if bad memory regions is detected
      it reserves them via memblock API.  Since memblock API is widely used by
      other architectures this feature can be enabled outside of x86 world.
      
      This patch set promotes memtest to live under generic mm umbrella and
      enables memtest feature for arm/arm64.
      
      It was reported that this patch set was useful for tracking down an issue
      with some errant DMA on an arm64 platform.
      
      This patch (of 6):
      
      There is nothing platform dependent in the core memtest code, so other
      platforms might benefit from this feature too.
      
      [linux@roeck-us.net: MEMTEST depends on MEMBLOCK]
      Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a20799d
    • K
      mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE · 204db6ed
      Kees Cook 提交于
      The arch_randomize_brk() function is used on several architectures,
      even those that don't support ET_DYN ASLR. To avoid bulky extern/#define
      tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for
      the architectures that support it, while still handling CONFIG_COMPAT_BRK.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      204db6ed
    • K
      mm: split ET_DYN ASLR from mmap ASLR · d1fd836d
      Kees Cook 提交于
      This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips,
      powerpc, and x86.  The problem is that if there is a leak of ASLR from
      the executable (ET_DYN), it means a leak of shared library offset as
      well (mmap), and vice versa.  Further details and a PoC of this attack
      is available here:
      
        http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html
      
      With this patch, a PIE linked executable (ET_DYN) has its own ASLR
      region:
      
        $ ./show_mmaps_pie
        54859ccd6000-54859ccd7000 r-xp  ...  /tmp/show_mmaps_pie
        54859ced6000-54859ced7000 r--p  ...  /tmp/show_mmaps_pie
        54859ced7000-54859ced8000 rw-p  ...  /tmp/show_mmaps_pie
        7f75be764000-7f75be91f000 r-xp  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75be91f000-7f75beb1f000 ---p  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75beb1f000-7f75beb23000 r--p  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75beb23000-7f75beb25000 rw-p  ...  /lib/x86_64-linux-gnu/libc.so.6
        7f75beb25000-7f75beb2a000 rw-p  ...
        7f75beb2a000-7f75beb4d000 r-xp  ...  /lib64/ld-linux-x86-64.so.2
        7f75bed45000-7f75bed46000 rw-p  ...
        7f75bed46000-7f75bed47000 r-xp  ...
        7f75bed47000-7f75bed4c000 rw-p  ...
        7f75bed4c000-7f75bed4d000 r--p  ...  /lib64/ld-linux-x86-64.so.2
        7f75bed4d000-7f75bed4e000 rw-p  ...  /lib64/ld-linux-x86-64.so.2
        7f75bed4e000-7f75bed4f000 rw-p  ...
        7fffb3741000-7fffb3762000 rw-p  ...  [stack]
        7fffb377b000-7fffb377d000 r--p  ...  [vvar]
        7fffb377d000-7fffb377f000 r-xp  ...  [vdso]
      
      The change is to add a call the newly created arch_mmap_rnd() into the
      ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR,
      as was already done on s390.  Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE,
      which is no longer needed.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1fd836d
    • K
      mm: expose arch_mmap_rnd when available · 2b68f6ca
      Kees Cook 提交于
      When an architecture fully supports randomizing the ELF load location,
      a per-arch mmap_rnd() function is used to find a randomized mmap base.
      In preparation for randomizing the location of ET_DYN binaries
      separately from mmap, this renames and exports these functions as
      arch_mmap_rnd(). Additionally introduces CONFIG_ARCH_HAS_ELF_RANDOMIZE
      for describing this feature on architectures that support it
      (which is a superset of ARCH_BINFMT_ELF_RANDOMIZE_PIE, since s390
      already supports a separated ET_DYN ASLR from mmap ASLR without the
      ARCH_BINFMT_ELF_RANDOMIZE_PIE logic).
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Russell King <linux@arm.linux.org.uk>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "David A. Long" <dave.long@linaro.org>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Arun Chandran <achandran@mvista.com>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Min-Hua Chen <orca.chen@gmail.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: Alex Smith <alex@alex-smith.me.uk>
      Cc: Markos Chandras <markos.chandras@imgtec.com>
      Cc: Vineeth Vijayan <vvijayan@mvista.com>
      Cc: Jeff Bailey <jeffbailey@google.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Behan Webster <behanw@converseincode.com>
      Cc: Ismael Ripoll <iripoll@upv.es>
      Cc: Jan-Simon Mller <dl9pf@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b68f6ca
    • K
      x86: standardize mmap_rnd() usage · 82168140
      Kees Cook 提交于
      In preparation for splitting out ET_DYN ASLR, this refactors the use of
      mmap_rnd() to be used similarly to arm, and extracts the checking of
      PF_RANDOMIZE.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82168140
    • T
      x86, mm: support huge KVA mappings on x86 · 6b637835
      Toshi Kani 提交于
      Implement huge KVA mapping interfaces on x86.
      
      On x86, MTRRs can override PAT memory types with a 4KB granularity.  When
      using a huge page, MTRRs can override the memory type of the huge page,
      which may lead a performance penalty.  The processor can also behave in an
      undefined manner if a huge page is mapped to a memory range that MTRRs
      have mapped with multiple different memory types.  Therefore, the mapping
      code falls back to use a smaller page size toward 4KB when a mapping range
      is covered by non-WB type of MTRRs.  The WB type of MTRRs has no affect on
      the PAT memory types.
      
      pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see if a
      given range is covered by MTRRs.  MTRR_TYPE_WRBACK indicates that the
      range is either covered by WB or not covered and the MTRR default value is
      set to WB.  0xFF indicates that MTRRs are disabled.
      
      HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE is set.
       X86_32 without X86_PAE is not supported since such config can unlikey be
      benefited from this feature, and there was an issue found in testing.
      
      [fengguang.wu@intel.com: ioremap_pud_capable can be static]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Robert Elliott <Elliott@hp.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b637835