1. 26 1月, 2019 1 次提交
  2. 13 1月, 2019 1 次提交
  3. 10 1月, 2019 1 次提交
    • S
      KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup · edcf33b1
      Sean Christopherson 提交于
      commit e81434995081fd7efb755fd75576b35dbb0850b1 upstream.
      
      ____kvm_handle_fault_on_reboot() provides a generic exception fixup
      handler that is used to cleanly handle faults on VMX/SVM instructions
      during reboot (or at least try to).  If there isn't a reboot in
      progress, ____kvm_handle_fault_on_reboot() treats any exception as
      fatal to KVM and invokes kvm_spurious_fault(), which in turn generates
      a BUG() to get a stack trace and die.
      
      When it was originally added by commit 4ecac3fd ("KVM: Handle
      virtualization instruction #UD faults during reboot"), the "call" to
      kvm_spurious_fault() was handcoded as PUSH+JMP, where the PUSH'd value
      is the RIP of the faulting instructing.
      
      The PUSH+JMP trickery is necessary because the exception fixup handler
      code lies outside of its associated function, e.g. right after the
      function.  An actual CALL from the .fixup code would show a slightly
      bogus stack trace, e.g. an extra "random" function would be inserted
      into the trace, as the return RIP on the stack would point to no known
      function (and the unwinder will likely try to guess who owns the RIP).
      
      Unfortunately, the JMP was replaced with a CALL when the macro was
      reworked to not spin indefinitely during reboot (commit b7c4145b
      "KVM: Don't spin on virt instruction faults during reboot").  This
      causes the aforementioned behavior where a bogus function is inserted
      into the stack trace, e.g. my builds like to blame free_kvm_area().
      
      Revert the CALL back to a JMP.  The changelog for commit b7c4145b
      ("KVM: Don't spin on virt instruction faults during reboot") contains
      nothing that indicates the switch to CALL was deliberate.  This is
      backed up by the fact that the PUSH <insn RIP> was left intact.
      
      Note that an alternative to the PUSH+JMP magic would be to JMP back
      to the "real" code and CALL from there, but that would require adding
      a JMP in the non-faulting path to avoid calling kvm_spurious_fault()
      and would add no value, i.e. the stack trace would be the same.
      
      Using CALL:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 4 PID: 1057 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc900004bbcc8 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888273fd8000 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000371fb0
      R13: 0000000000000000 R14: 000000026d763cf4 R15: ffff888273fd8000
      FS:  00007f3d69691700(0000) GS:ffff888277800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000055f89bc56fe0 CR3: 0000000271a5a001 CR4: 0000000000362ee0
      Call Trace:
       free_kvm_area+0x1044/0x43ea [kvm_intel]
       ? vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace 9775b14b123b1713 ]---
      
      Using JMP:
      
      ------------[ cut here ]------------
      kernel BUG at /home/sean/go/src/kernel.org/linux/arch/x86/kvm/x86.c:356!
      invalid opcode: 0000 [#1] SMP
      CPU: 6 PID: 1067 Comm: qemu-system-x86 Not tainted 4.20.0-rc6+ #75
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      RIP: 0010:kvm_spurious_fault+0x5/0x10 [kvm]
      Code: <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 49 89 fd 41
      RSP: 0018:ffffc90000497cd0 EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffffffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88827058bd40 R08: 00000000000003e8 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000784 R12: ffffc90000369fb0
      R13: 0000000000000000 R14: 00000003c8fc6642 R15: ffff88827058bd40
      FS:  00007f3d7219e700(0000) GS:ffff888277900000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f3d64001000 CR3: 0000000271c6b004 CR4: 0000000000362ee0
      Call Trace:
       vmx_vcpu_run+0x156/0x630 [kvm_intel]
       ? kvm_arch_vcpu_ioctl_run+0x447/0x1a40 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? kvm_vcpu_ioctl+0x368/0x5c0 [kvm]
       ? __set_task_blocked+0x38/0x90
       ? __set_current_blocked+0x50/0x60
       ? __fpu__restore_sig+0x97/0x490
       ? do_vfs_ioctl+0xa1/0x620
       ? __x64_sys_futex+0x89/0x180
       ? ksys_ioctl+0x66/0x70
       ? __x64_sys_ioctl+0x16/0x20
       ? do_syscall_64+0x4f/0x100
       ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Modules linked in: vhost_net vhost tap kvm_intel kvm irqbypass bridge stp llc
      ---[ end trace f9daedb85ab3ddba ]---
      
      Fixes: b7c4145b ("KVM: Don't spin on virt instruction faults during reboot")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      edcf33b1
  4. 29 12月, 2018 1 次提交
  5. 21 12月, 2018 1 次提交
    • P
      locking/qspinlock, x86: Provide liveness guarantee · 26586875
      Peter Zijlstra 提交于
      commit 7aa54be2 upstream.
      
      On x86 we cannot do fetch_or() with a single instruction and thus end up
      using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
      with a composite operation: tas-pending + load.
      
      Using two instructions of course opens a window we previously did not
      have. Consider the scenario:
      
      	CPU0		CPU1		CPU2
      
       1)	lock
      	  trylock -> (0,0,1)
      
       2)			lock
      			  trylock /* fail */
      
       3)	unlock -> (0,0,0)
      
       4)					lock
      					  trylock -> (0,0,1)
      
       5)			  tas-pending -> (0,1,1)
      			  load-val <- (0,1,0) from 3
      
       6)			  clear-pending-set-locked -> (0,0,1)
      
      			  FAIL: _2_ owners
      
      where 5) is our new composite operation. When we consider each part of
      the qspinlock state as a separate variable (as we can when
      _Q_PENDING_BITS == 8) then the above is entirely possible, because
      tas-pending will only RmW the pending byte, so the later load is able
      to observe prior tail and lock state (but not earlier than its own
      trylock, which operates on the whole word, due to coherence).
      
      To avoid this we need 2 things:
      
       - the load must come after the tas-pending (obviously, otherwise it
         can trivially observe prior state).
      
       - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
         example, such that we cannot observe other state prior to setting
         pending.
      
      On x86 we can realize this by using "LOCK BTS m32, r32" for
      tas-pending followed by a regular load.
      
      Note that observing later state is not a problem:
      
       - if we fail to observe a later unlock, we'll simply spin-wait for
         that store to become visible.
      
       - if we observe a later xchg_tail(), there is no difference from that
         xchg_tail() having taken place before the tas-pending.
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: andrea.parri@amarulasolutions.com
      Cc: longman@redhat.com
      Fixes: 59fb586b ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
      Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      [bigeasy: GEN_BINARY_RMWcc macro redo]
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      26586875
  6. 06 12月, 2018 14 次提交
    • L
      KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset · 76c8476c
      Leonid Shatz 提交于
      commit 326e7425 upstream.
      
      Since commit e79f245d ("X86/KVM: Properly update 'tsc_offset' to
      represent the running guest"), vcpu->arch.tsc_offset meaning was
      changed to always reflect the tsc_offset value set on active VMCS.
      Regardless if vCPU is currently running L1 or L2.
      
      However, above mentioned commit failed to also change
      kvm_vcpu_write_tsc_offset() to set vcpu->arch.tsc_offset correctly.
      This is because vmx_write_tsc_offset() could set the tsc_offset value
      in active VMCS to given offset parameter *plus vmcs12->tsc_offset*.
      However, kvm_vcpu_write_tsc_offset() just sets vcpu->arch.tsc_offset
      to given offset parameter. Without taking into account the possible
      addition of vmcs12->tsc_offset. (Same is true for SVM case).
      
      Fix this issue by changing kvm_x86_ops->write_tsc_offset() to return
      actually set tsc_offset in active VMCS and modify
      kvm_vcpu_write_tsc_offset() to set returned value in
      vcpu->arch.tsc_offset.
      In addition, rename write_tsc_offset() callback to write_l1_tsc_offset()
      to make it clear that it is meant to set L1 TSC offset.
      
      Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLeonid Shatz <leonid.shatz@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      76c8476c
    • T
      x86/speculation: Add seccomp Spectre v2 user space protection mode · d1ec2354
      Thomas Gleixner 提交于
      commit 6b3e64c237c072797a9ec918654a60e3a46488e2 upstream
      
      If 'prctl' mode of user space protection from spectre v2 is selected
      on the kernel command-line, STIBP and IBPB are applied on tasks which
      restrict their indirect branch speculation via prctl.
      
      SECCOMP enables the SSBD mitigation for sandboxed tasks already, so it
      makes sense to prevent spectre v2 user space to user space attacks as
      well.
      
      The Intel mitigation guide documents how STIPB works:
          
         Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
         prevents the predicted targets of indirect branches on any logical
         processor of that core from being controlled by software that executes
         (or executed previously) on another logical processor of the same core.
      
      Ergo setting STIBP protects the task itself from being attacked from a task
      running on a different hyper-thread and protects the tasks running on
      different hyper-threads from being attacked.
      
      While the document suggests that the branch predictors are shielded between
      the logical processors, the observed performance regressions suggest that
      STIBP simply disables the branch predictor more or less completely. Of
      course the document wording is vague, but the fact that there is also no
      requirement for issuing IBPB when STIBP is used points clearly in that
      direction. The kernel still issues IBPB even when STIBP is used until Intel
      clarifies the whole mechanism.
      
      IBPB is issued when the task switches out, so malicious sandbox code cannot
      mistrain the branch predictor for the next user space task on the same
      logical processor.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185006.051663132@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d1ec2354
    • T
      x86/speculation: Add prctl() control for indirect branch speculation · 238ba6e7
      Thomas Gleixner 提交于
      commit 9137bb27 upstream
      
      Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
      PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
      indirect branch speculation via STIBP and IBPB.
      
      Invocations:
       Check indirect branch speculation status with
       - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
      
       Enable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
      
       Disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
      
       Force disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
      
      See Documentation/userspace-api/spec_ctrl.rst.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      238ba6e7
    • T
      x86/speculation: Prevent stale SPEC_CTRL msr content · e8412401
      Thomas Gleixner 提交于
      commit 6d991ba509ebcfcc908e009d1db51972a4f7a064 upstream
      
      The seccomp speculation control operates on all tasks of a process, but
      only the current task of a process can update the MSR immediately. For the
      other threads the update is deferred to the next context switch.
      
      This creates the following situation with Process A and B:
      
      Process A task 2 and Process B task 1 are pinned on CPU1. Process A task 2
      does not have the speculation control TIF bit set. Process B task 1 has the
      speculation control TIF bit set.
      
      CPU0					CPU1
      					MSR bit is set
      					ProcB.T1 schedules out
      					ProcA.T2 schedules in
      					MSR bit is cleared
      ProcA.T1
        seccomp_update()
        set TIF bit on ProcA.T2
      					ProcB.T1 schedules in
      					MSR is not updated  <-- FAIL
      
      This happens because the context switch code tries to avoid the MSR update
      if the speculation control TIF bits of the incoming and the outgoing task
      are the same. In the worst case ProcB.T1 and ProcA.T2 are the only tasks
      scheduling back and forth on CPU1, which keeps the MSR stale forever.
      
      In theory this could be remedied by IPIs, but chasing the remote task which
      could be migrated is complex and full of races.
      
      The straight forward solution is to avoid the asychronous update of the TIF
      bit and defer it to the next context switch. The speculation control state
      is stored in task_struct::atomic_flags by the prctl and seccomp updates
      already.
      
      Add a new TIF_SPEC_FORCE_UPDATE bit and set this after updating the
      atomic_flags. Check the bit on context switch and force a synchronous
      update of the speculation control if set. Use the same mechanism for
      updating the current task.
      Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1811272247140.1875@nanos.tec.linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8412401
    • T
      x86/speculation: Prepare for conditional IBPB in switch_mm() · a1788815
      Thomas Gleixner 提交于
      commit 4c71a2b6fd7e42814aa68a6dec88abf3b42ea573 upstream
      
      The IBPB speculation barrier is issued from switch_mm() when the kernel
      switches to a user space task with a different mm than the user space task
      which ran last on the same CPU.
      
      An additional optimization is to avoid IBPB when the incoming task can be
      ptraced by the outgoing task. This optimization only works when switching
      directly between two user space tasks. When switching from a kernel task to
      a user space task the optimization fails because the previous task cannot
      be accessed anymore. So for quite some scenarios the optimization is just
      adding overhead.
      
      The upcoming conditional IBPB support will issue IBPB only for user space
      tasks which have the TIF_SPEC_IB bit set. This requires to handle the
      following cases:
      
        1) Switch from a user space task (potential attacker) which has
           TIF_SPEC_IB set to a user space task (potential victim) which has
           TIF_SPEC_IB not set.
      
        2) Switch from a user space task (potential attacker) which has
           TIF_SPEC_IB not set to a user space task (potential victim) which has
           TIF_SPEC_IB set.
      
      This needs to be optimized for the case where the IBPB can be avoided when
      only kernel threads ran in between user space tasks which belong to the
      same process.
      
      The current check whether two tasks belong to the same context is using the
      tasks context id. While correct, it's simpler to use the mm pointer because
      it allows to mangle the TIF_SPEC_IB bit into it. The context id based
      mechanism requires extra storage, which creates worse code.
      
      When a task is scheduled out its TIF_SPEC_IB bit is mangled as bit 0 into
      the per CPU storage which is used to track the last user space mm which was
      running on a CPU. This bit can be used together with the TIF_SPEC_IB bit of
      the incoming task to make the decision whether IBPB needs to be issued or
      not to cover the two cases above.
      
      As conditional IBPB is going to be the default, remove the dubious ptrace
      check for the IBPB always case and simply issue IBPB always when the
      process changes.
      
      Move the storage to a different place in the struct as the original one
      created a hole.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.466447057@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a1788815
    • T
      x86/speculation: Avoid __switch_to_xtra() calls · dd73e15e
      Thomas Gleixner 提交于
      commit 5635d999 upstream
      
      The TIF_SPEC_IB bit does not need to be evaluated in the decision to invoke
      __switch_to_xtra() when:
      
       - CONFIG_SMP is disabled
      
       - The conditional STIPB mode is disabled
      
      The TIF_SPEC_IB bit still controls IBPB in both cases so the TIF work mask
      checks might invoke __switch_to_xtra() for nothing if TIF_SPEC_IB is the
      only set bit in the work masks.
      
      Optimize it out by masking the bit at compile time for CONFIG_SMP=n and at
      run time when the static key controlling the conditional STIBP mode is
      disabled.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.374062201@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd73e15e
    • T
      x86/process: Consolidate and simplify switch_to_xtra() code · a87c81f0
      Thomas Gleixner 提交于
      commit ff16701a upstream
      
      Move the conditional invocation of __switch_to_xtra() into an inline
      function so the logic can be shared between 32 and 64 bit.
      
      Remove the handthrough of the TSS pointer and retrieve the pointer directly
      in the bitmap handling function. Use this_cpu_ptr() instead of the
      per_cpu() indirection.
      
      This is a preparatory change so integration of conditional indirect branch
      speculation optimization happens only in one place.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.280855518@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a87c81f0
    • T
      x86/speculation: Prepare for per task indirect branch speculation control · 69985a2c
      Tim Chen 提交于
      commit 5bfbe3ad upstream
      
      To avoid the overhead of STIBP always on, it's necessary to allow per task
      control of STIBP.
      
      Add a new task flag TIF_SPEC_IB and evaluate it during context switch if
      SMT is active and flag evaluation is enabled by the speculation control
      code. Add the conditional evaluation to x86_virt_spec_ctrl() as well so the
      guest/host switch works properly.
      
      This has no effect because TIF_SPEC_IB cannot be set yet and the static key
      which controls evaluation is off. Preparatory patch for adding the control
      code.
      
      [ tglx: Simplify the context switch logic and make the TIF evaluation
        	depend on SMP=y and on the static key controlling the conditional
        	update. Rename it to TIF_SPEC_IB because it controls both STIBP and
        	IBPB ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.176917199@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69985a2c
    • T
      x86/speculation: Add command line control for indirect branch speculation · 71187543
      Thomas Gleixner 提交于
      commit fa1202ef224391b6f5b26cdd44cc50495e8fab54 upstream
      
      Add command line control for user space indirect branch speculation
      mitigations. The new option is: spectre_v2_user=
      
      The initial options are:
      
          -  on:   Unconditionally enabled
          - off:   Unconditionally disabled
          -auto:   Kernel selects mitigation (default off for now)
      
      When the spectre_v2= command line argument is either 'on' or 'off' this
      implies that the application to application control follows that state even
      if a contradicting spectre_v2_user= argument is supplied.
      Originally-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.082720373@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71187543
    • T
      x86/speculation: Rename SSBD update functions · 39402a5e
      Thomas Gleixner 提交于
      commit 26c4d75b234040c11728a8acb796b3a85ba7507c upstream
      
      During context switch, the SSBD bit in SPEC_CTRL MSR is updated according
      to changes of the TIF_SSBD flag in the current and next running task.
      
      Currently, only the bit controlling speculative store bypass disable in
      SPEC_CTRL MSR is updated and the related update functions all have
      "speculative_store" or "ssb" in their names.
      
      For enhanced mitigation control other bits in SPEC_CTRL MSR need to be
      updated as well, which makes the SSB names inadequate.
      
      Rename the "speculative_store*" functions to a more generic name. No
      functional change.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.058866968@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39402a5e
    • T
      x86/speculation: Update the TIF_SSBD comment · e8494e5d
      Tim Chen 提交于
      commit 8eb729b77faf83ac4c1f363a9ad68d042415f24c upstream
      
      "Reduced Data Speculation" is an obsolete term. The correct new name is
      "Speculative store bypass disable" - which is abbreviated into SSBD.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185003.593893901@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8494e5d
    • Z
      x86/retpoline: Remove minimal retpoline support · 90d2c53f
      Zhenzhong Duan 提交于
      commit ef014aae upstream
      
      Now that CONFIG_RETPOLINE hard depends on compiler support, there is no
      reason to keep the minimal retpoline support around which only provided
      basic protection in the assembly files.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: <srinivas.eeda@oracle.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/f06f0a89-5587-45db-8ed2-0a9d6638d5c0@defaultSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90d2c53f
    • Z
      x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support · 8c4ad5d3
      Zhenzhong Duan 提交于
      commit 4cd24de3 upstream
      
      Since retpoline capable compilers are widely available, make
      CONFIG_RETPOLINE hard depend on the compiler capability.
      
      Break the build when CONFIG_RETPOLINE is enabled and the compiler does not
      support it. Emit an error message in that case:
      
       "arch/x86/Makefile:226: *** You are building kernel with non-retpoline
        compiler, please update your compiler..  Stop."
      
      [dwmw: Fail the build with non-retpoline compiler]
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Michal Marek <michal.lkml@markovi.net>
      Cc: <srinivas.eeda@oracle.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/cca0cb20-f9e2-4094-840b-fb0f8810cd34@defaultSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c4ad5d3
    • Z
      x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant · cbc93677
      Zhenzhong Duan 提交于
      commit 0cbb76d6285794f30953bfa3ab831714b59dd700 upstream
      
      ..so that they match their asm counterpart.
      
      Add the missing ANNOTATE_NOSPEC_ALTERNATIVE in CALL_NOSPEC, while at it.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wang YanQing <udknight@gmail.com>
      Cc: dhaval.giani@oracle.com
      Cc: srinivas.eeda@oracle.com
      Link: http://lkml.kernel.org/r/c3975665-173e-4d70-8dee-06c926ac26ee@defaultSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbc93677
  7. 27 11月, 2018 1 次提交
  8. 21 11月, 2018 2 次提交
    • V
      acpi/nfit, x86/mce: Validate a MCE's address before using it · 8c547624
      Vishal Verma 提交于
      commit e8a308e5 upstream.
      
      The NFIT machine check handler uses the physical address from the mce
      structure, and compares it against information in the ACPI NFIT table
      to determine whether that location lies on an NVDIMM. The mce->addr
      field however may not always be valid, and this is indicated by the
      MCI_STATUS_ADDRV bit in the status field.
      
      Export mce_usable_address() which already performs validation for the
      address, and use it in the NFIT handler.
      
      Fixes: 6839a6d9 ("nfit: do an ARS scrub on hitting a latent media error")
      Reported-by: NRobert Elliott <elliott@hpe.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Arnd Bergmann <arnd@arndb.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      CC: Dave Jiang <dave.jiang@intel.com>
      CC: elliott@hpe.com
      CC: "H. Peter Anvin" <hpa@zytor.com>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: Len Brown <lenb@kernel.org>
      CC: linux-acpi@vger.kernel.org
      CC: linux-edac <linux-edac@vger.kernel.org>
      CC: linux-nvdimm@lists.01.org
      CC: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      CC: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      CC: Ross Zwisler <zwisler@kernel.org>
      CC: stable <stable@vger.kernel.org>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Tony Luck <tony.luck@intel.com>
      CC: x86-ml <x86@kernel.org>
      CC: Yazen Ghannam <yazen.ghannam@amd.com>
      Link: http://lkml.kernel.org/r/20181026003729.8420-2-vishal.l.verma@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8c547624
    • V
      acpi/nfit, x86/mce: Handle only uncorrectable machine checks · 9013ac4d
      Vishal Verma 提交于
      commit 5d96c9342c23ee1d084802dcf064caa67ecaa45b upstream.
      
      The MCE handler for nfit devices is called for memory errors on a
      Non-Volatile DIMM and adds the error location to a 'badblocks' list.
      This list is used by the various NVDIMM drivers to avoid consuming known
      poison locations during IO.
      
      The MCE handler gets called for both corrected and uncorrectable errors.
      Until now, both kinds of errors have been added to the badblocks list.
      However, corrected memory errors indicate that the problem has already
      been fixed by hardware, and the resulting interrupt is merely a
      notification to Linux.
      
      As far as future accesses to that location are concerned, it is
      perfectly fine to use, and thus doesn't need to be included in the above
      badblocks list.
      
      Add a check in the nfit MCE handler to filter out corrected mce events,
      and only process uncorrectable errors.
      
      Fixes: 6839a6d9 ("nfit: do an ARS scrub on hitting a latent media error")
      Reported-by: NOmar Avelar <omar.avelar@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Arnd Bergmann <arnd@arndb.de>
      CC: Dan Williams <dan.j.williams@intel.com>
      CC: Dave Jiang <dave.jiang@intel.com>
      CC: elliott@hpe.com
      CC: "H. Peter Anvin" <hpa@zytor.com>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: Len Brown <lenb@kernel.org>
      CC: linux-acpi@vger.kernel.org
      CC: linux-edac <linux-edac@vger.kernel.org>
      CC: linux-nvdimm@lists.01.org
      CC: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
      CC: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      CC: Ross Zwisler <zwisler@kernel.org>
      CC: stable <stable@vger.kernel.org>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Tony Luck <tony.luck@intel.com>
      CC: x86-ml <x86@kernel.org>
      CC: Yazen Ghannam <yazen.ghannam@amd.com>
      Link: http://lkml.kernel.org/r/20181026003729.8420-1-vishal.l.verma@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9013ac4d
  9. 14 11月, 2018 2 次提交
    • J
      KVM: nVMX: Clear reserved bits of #DB exit qualification · 3223e557
      Jim Mattson 提交于
      [ Upstream commit cfb634fe3052aefc4e1360fa322018c9a0b49755 ]
      
      According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit
      qualification field for debug exceptions are reserved (cleared to
      0). However, the SDM is incorrect about bit 16 (corresponding to
      DR6.RTM). This bit should be set if a debug exception (#DB) or a
      breakpoint exception (#BP) occurred inside an RTM region while
      advanced debugging of RTM transactional regions was enabled. Note that
      this is the opposite of DR6.RTM, which "indicates (when clear) that a
      debug exception (#DB) or breakpoint exception (#BP) occurred inside an
      RTM region while advanced debugging of RTM transactional regions was
      enabled."
      
      There is still an issue with stale DR6 bits potentially being
      misreported for the current debug exception.  DR6 should not have been
      modified before vectoring the #DB exception, and the "new DR6 bits"
      should be available somewhere, but it was and they aren't.
      
      Fixes: b96fb439 ("KVM: nVMX: fixes to nested virt interrupt injection")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3223e557
    • S
      x86/mm/pat: Disable preemption around __flush_tlb_all() · e36453b6
      Sebastian Andrzej Siewior 提交于
      commit f77084d96355f5fba8e2c1fb3a51a393b1570de7 upstream.
      
      The WARN_ON_ONCE(__read_cr3() != build_cr3()) in switch_mm_irqs_off()
      triggers every once in a while during a snapshotted system upgrade.
      
      The warning triggers since commit decab088 ("x86/mm: Remove
      preempt_disable/enable() from __native_flush_tlb()"). The callchain is:
      
        get_page_from_freelist() -> post_alloc_hook() -> __kernel_map_pages()
      
      with CONFIG_DEBUG_PAGEALLOC enabled.
      
      Disable preemption during CR3 reset / __flush_tlb_all() and add a comment
      why preemption has to be disabled so it won't be removed accidentaly.
      
      Add another preemptible() check in __flush_tlb_all() to catch callers with
      enabled preemption when PGE is enabled, because PGE enabled does not
      trigger the warning in __native_flush_tlb(). Suggested by Andy Lutomirski.
      
      Fixes: decab088 ("x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()")
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181017103432.zgv46nlu3hc7k4rq@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e36453b6
  10. 17 10月, 2018 1 次提交
  11. 14 10月, 2018 1 次提交
  12. 10 10月, 2018 1 次提交
    • J
      mm: Preserve _PAGE_DEVMAP across mprotect() calls · 4628a645
      Jan Kara 提交于
      Currently _PAGE_DEVMAP bit is not preserved in mprotect(2) calls. As a
      result we will see warnings such as:
      
      BUG: Bad page map in process JobWrk0013  pte:800001803875ea25 pmd:7624381067
      addr:00007f0930720000 vm_flags:280000f9 anon_vma:          (null) mapping:ffff97f2384056f0 index:0
      file:457-000000fe00000030-00000009-000000ca-00000001_2001.fileblock fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap [xfs] readpage:          (null)
      CPU: 3 PID: 15848 Comm: JobWrk0013 Tainted: G        W          4.12.14-2.g7573215-default #1 SLE12-SP4 (unreleased)
      Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.01.00.0833.051120182255 05/11/2018
      Call Trace:
       dump_stack+0x5a/0x75
       print_bad_pte+0x217/0x2c0
       ? enqueue_task_fair+0x76/0x9f0
       _vm_normal_page+0xe5/0x100
       zap_pte_range+0x148/0x740
       unmap_page_range+0x39a/0x4b0
       unmap_vmas+0x42/0x90
       unmap_region+0x99/0xf0
       ? vma_gap_callbacks_rotate+0x1a/0x20
       do_munmap+0x255/0x3a0
       vm_munmap+0x54/0x80
       SyS_munmap+0x1d/0x30
       do_syscall_64+0x74/0x150
       entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      ...
      
      when mprotect(2) gets used on DAX mappings. Also there is a wide variety
      of other failures that can result from the missing _PAGE_DEVMAP flag
      when the area gets used by get_user_pages() later.
      
      Fix the problem by including _PAGE_DEVMAP in a set of flags that get
      preserved by mprotect(2).
      
      Fixes: 69660fd7 ("x86, mm: introduce _PAGE_DEVMAP")
      Fixes: ebd31197 ("powerpc/mm: Add devmap support for ppc64")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4628a645
  13. 03 10月, 2018 1 次提交
  14. 02 10月, 2018 1 次提交
  15. 21 9月, 2018 1 次提交
  16. 20 9月, 2018 5 次提交
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
    • L
      KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c
      Liran Alon 提交于
      In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
      it is possible for a vCPU to be blocked while it is in guest-mode.
      
      According to Intel SDM 26.6.5 Interrupt-Window Exiting and
      Virtual-Interrupt Delivery: "These events wake the logical processor
      if it just entered the HLT state because of a VM entry".
      Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
      deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
      should be waken from the HLT state and injected with the interrupt.
      
      In addition, if while the vCPU is blocked (while it is in guest-mode),
      it receives a nested posted-interrupt, then the vCPU should also be
      waken and injected with the posted interrupt.
      
      To handle these cases, this patch enhances kvm_vcpu_has_events() to also
      check if there is a pending interrupt in L2 virtual APICv provided by
      L1. That is, it evaluates if there is a pending virtual interrupt for L2
      by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
      Evaluation of Pending Interrupts.
      
      Note that this also handles the case of nested posted-interrupt by the
      fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
      called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
      kvm_vcpu_running() -> vmx_check_nested_events() ->
      vmx_complete_nested_posted_interrupt().
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e6c67d8c
    • V
      x86/hyper-v: rename ipi_arg_{ex,non_ex} structures · a1efa9b7
      Vitaly Kuznetsov 提交于
      These structures are going to be used from KVM code so let's make
      their names reflect their Hyper-V origin.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Acked-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a1efa9b7
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • V
      x86/kvm/lapic: always disable MMIO interface in x2APIC mode · d1766202
      Vitaly Kuznetsov 提交于
      When VMX is used with flexpriority disabled (because of no support or
      if disabled with module parameter) MMIO interface to lAPIC is still
      available in x2APIC mode while it shouldn't be (kvm-unit-tests):
      
      PASS: apic_disable: Local apic enabled in x2APIC mode
      PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
      FAIL: apic_disable: *0xfee00030: 50014
      
      The issue appears because we basically do nothing while switching to
      x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
      only check if lAPIC is disabled before proceeding to actual write.
      
      When APIC access is virtualized we correctly manipulate with VMX controls
      in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
      in x2APIC mode so there's no issue.
      
      Disabling MMIO interface seems to be easy. The question is: what do we
      do with these reads and writes? If we add apic_x2apic_mode() check to
      apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
      go to userspace. When lAPIC is in kernel, Qemu uses this interface to
      inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
      somehow works with disabled lAPIC but when we're in xAPIC mode we will
      get a real injected MSI from every write to lAPIC. Not good.
      
      The simplest solution seems to be to just ignore writes to the region
      and return ~0 for all reads when we're in x2APIC mode. This is what this
      patch does. However, this approach is inconsistent with what currently
      happens when flexpriority is enabled: we allocate APIC access page and
      create KVM memory region so in x2APIC modes all reads and writes go to
      this pre-allocated page which is, btw, the same for all vCPUs.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1766202
  17. 16 9月, 2018 1 次提交
    • B
      x86/mm: Add .bss..decrypted section to hold shared variables · b3f0907c
      Brijesh Singh 提交于
      kvmclock defines few static variables which are shared with the
      hypervisor during the kvmclock initialization.
      
      When SEV is active, memory is encrypted with a guest-specific key, and
      if the guest OS wants to share the memory region with the hypervisor
      then it must clear the C-bit before sharing it.
      
      Currently, we use kernel_physical_mapping_init() to split large pages
      before clearing the C-bit on shared pages. But it fails when called from
      the kvmclock initialization (mainly because the memblock allocator is
      not ready that early during boot).
      
      Add a __bss_decrypted section attribute which can be used when defining
      such shared variable. The so-defined variables will be placed in the
      .bss..decrypted section. This section will be mapped with C=0 early
      during boot.
      
      The .bss..decrypted section has a big chunk of memory that may be unused
      when memory encryption is not active, free it when memory encryption is
      not active.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Radim Krčmář<rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
      b3f0907c
  18. 14 9月, 2018 1 次提交
    • J
      Revert "x86/mm/legacy: Populate the user page-table with user pgd's" · 61a6bd83
      Joerg Roedel 提交于
      This reverts commit 1f40a46c.
      
      It turned out that this patch is not sufficient to enable PTI on 32 bit
      systems with legacy 2-level page-tables. In this paging mode the huge-page
      PTEs are in the top-level page-table directory, where also the mirroring to
      the user-space page-table happens. So every huge PTE exits twice, in the
      kernel and in the user page-table.
      
      That means that accessed/dirty bits need to be fetched from two PTEs in
      this mode to be safe, but this is not trivial to implement because it needs
      changes to generic code just for the sake of enabling PTI with 32-bit
      legacy paging. As all systems that need PTI should support PAE anyway,
      remove support for PTI when 32-bit legacy paging is used.
      
      Fixes: 7757d607 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
      61a6bd83
  19. 08 9月, 2018 2 次提交
    • N
      x86/mm: Use WRITE_ONCE() when setting PTEs · 9bc4f28a
      Nadav Amit 提交于
      When page-table entries are set, the compiler might optimize their
      assignment by using multiple instructions to set the PTE. This might
      turn into a security hazard if the user somehow manages to use the
      interim PTE. L1TF does not make our lives easier, making even an interim
      non-present PTE a security hazard.
      
      Using WRITE_ONCE() to set PTEs and friends should prevent this potential
      security hazard.
      
      I skimmed the differences in the binary with and without this patch. The
      differences are (obviously) greater when CONFIG_PARAVIRT=n as more
      code optimizations are possible. For better and worse, the impact on the
      binary with this patch is pretty small. Skimming the code did not cause
      anything to jump out as a security hazard, but it seems that at least
      move_soft_dirty_pte() caused set_pte_at() to use multiple writes.
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
      9bc4f28a
    • W
      KVM: LAPIC: Fix pv ipis out-of-bounds access · bdf7ffc8
      Wanpeng Li 提交于
      Dan Carpenter reported that the untrusted data returns from kvm_register_read()
      results in the following static checker warning:
        arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
        error: buffer underflow 'map->phys_map' 's32min-s32max'
      
      KVM guest can easily trigger this by executing the following assembly sequence
      in Ring0:
      
      mov $10, %rax
      mov $0xFFFFFFFF, %rbx
      mov $0xFFFFFFFF, %rdx
      mov $0, %rsi
      vmcall
      
      As this will cause KVM to execute the following code-path:
      vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
      which will reach out-of-bounds access.
      
      This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
      ignoring destinations that are not present and delivering the rest. We also check
      whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
      max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
      unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bdf7ffc8
  20. 07 9月, 2018 1 次提交