提交 · b2502b418e63fcde0fe1857732a476b5aa3789b1 · _Walt / cloud-kernel

08 6月, 2015 2 次提交

x86/asm/entry: Untangle 'system_call' into two entry points: entry_SYSCALL_64 and entry_INT80_32 · b2502b41

由 Ingo Molnar 提交于 6月 08, 2015

The 'system_call' entry points differ starkly between native 32-bit and 64-bit
kernels: on 32-bit kernels it defines the INT 0x80 entry point, while on
64-bit it's the SYSCALL entry point.

This is pretty confusing when looking at generic code, and it also obscures
the nature of the entry point at the assembly level.

So unangle this by splitting the name into its two uses:

	system_call (32) -> entry_INT80_32
	system_call (64) -> entry_SYSCALL_64

As per the generic naming scheme for x86 system call entry points:

	entry_MNEMONIC_qualifier

where 'qualifier' is one of _32, _64 or _compat.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

b2502b41

x86/asm/entry: Rename compat syscall entry points · 2cd23553

由 Ingo Molnar 提交于 6月 08, 2015

Rename the following system call entry points:

	ia32_cstar_target       -> entry_SYSCALL_compat
	ia32_syscall            -> entry_INT80_compat

The generic naming scheme for x86 system call entry points is:

	entry_MNEMONIC_qualifier

where 'qualifier' is one of _32, _64 or _compat.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

2cd23553

10 5月, 2015 1 次提交

x86/asm/entry: Remove SYSCALL_VECTOR · 51bb9284

由 Brian Gerst 提交于 5月 09, 2015

Use IA32_SYSCALL_VECTOR for both compat and native.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431185813-15413-4-git-send-email-brgerst@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

51bb9284

31 3月, 2015 1 次提交

x86/asm/entry: Remove user_mode_ignore_vm86() · 55474c48

由 Ingo Molnar 提交于 3月 29, 2015

user_mode_ignore_vm86() can be used instead of user_mode(), in
places where we have already done a v8086_mode() security
check of ptregs.

But doing this check in the wrong place would be a bug that
could result in security problems, and also the naming still
isn't very clear.

Furthermore, it only affects 32-bit kernels, while most
development happens on 64-bit kernels.

If we replace them with user_mode() checks then the cost is only
a very minor increase in various slowpaths:

   text             data   bss     dec              hex    filename
   10573391         703562 1753042 13029995         c6d26b vmlinux.o.before
   10573423         703562 1753042 13030027         c6d28b vmlinux.o.after

So lets get rid of this distinction once and for all.
Acked-by: NBorislav Petkov <bp@suse.de>
Acked-by: NAndy Lutomirski <luto@kernel.org>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150329090233.GA1963@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

55474c48

23 3月, 2015 4 次提交

x86/asm/entry: Replace some open-coded VM86 checks with v8086_mode() checks · d74ef111

由 Andy Lutomirski 提交于 3月 18, 2015

This allows us to remove some unnecessary ifdefs.  There should
be no change to the generated code.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f7e00f0d668e253abf0bd8bf36491ac47bd761ff.1426728647.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

d74ef111

x86/asm/entry: Change all 'user_mode_vm()' calls to 'user_mode()' · f39b6f0e

由 Andy Lutomirski 提交于 3月 18, 2015

user_mode_vm() and user_mode() are now the same.  Change all callers
of user_mode_vm() to user_mode().

The next patch will remove the definition of user_mode_vm.
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/43b1f57f3df70df5a08b0925897c660725015554.1426728647.git.luto@kernel.org
[ Merged to a more recent kernel. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

f39b6f0e

x86/asm/entry: Use user_mode_ignore_vm86() where appropriate · ae60f071

由 Andy Lutomirski 提交于 3月 18, 2015

A few of the user_mode() checks in traps.c are immediately after
explicit checks for vm86 mode.  Change them to user_mode_ignore_vm86().
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/0b324d5b75c3402be07f8d3c6245ed7f4995029e.1426728647.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

ae60f071

x86/fpu: Rename drop_init_fpu() to fpu_reset_state() · b85e67d1

由 Borislav Petkov 提交于 3月 16, 2015

Call it what it does and in accordance with the context where it is
used: we reset the FPU state either because we were unable to restore it
from the one saved in the task or because we simply want to reset it.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

b85e67d1

10 3月, 2015 1 次提交

x86/asm/entry/32: Fix user_mode() misuses · 394838c9

由 Andy Lutomirski 提交于 3月 09, 2015

The one in do_debug() is probably harmless, but better safe than sorry.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d67deaa9df5458363623001f252d1aee3215d014.1425948056.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

394838c9

09 3月, 2015 1 次提交

context_tracking: Rename context symbols to prepare for transition state · c467ea76

由 Frederic Weisbecker 提交于 3月 04, 2015

Current context tracking symbols are designed to express living state.
As such they are prefixed with "IN_": IN_USER, IN_KERNEL.

Now we are going to use these symbols to also express state transitions
such as context_tracking_enter(IN_USER) or context_tracking_exit(IN_USER).
But while the "IN_" prefix works well to express entering a context, it's
confusing to depict a context exit: context_tracking_exit(IN_USER)
could mean two things:
	1) We are exiting the current context to enter user context.
	2) We are exiting the user context
We want 2) but the reviewer may be confused and understand 1)

So lets disambiguate these symbols and rename them to CONTEXT_USER and
CONTEXT_KERNEL.
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will deacon <will.deacon@arm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

c467ea76

07 3月, 2015 1 次提交

x86/asm/entry: Replace this_cpu_sp0() with current_top_of_stack() and fix it on x86_32 · a7fcf28d

由 Andy Lutomirski 提交于 3月 06, 2015

I broke 32-bit kernels.  The implementation of sp0 was correct
as far as I can tell, but sp0 was much weirder on x86_32 than I
realized.  It has the following issues:

 - Init's sp0 is inconsistent with everything else's: non-init tasks
   are offset by 8 bytes.  (I have no idea why, and the comment is unhelpful.)

 - vm86 does crazy things to sp0.

Fix it up by replacing this_cpu_sp0() with
current_top_of_stack() and using a new percpu variable to track
the top of the stack on x86_32.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 75182b16 ("x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0()")
Link: http://lkml.kernel.org/r/d09dbe270883433776e0cbee3c7079433349e96d.1425692936.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

a7fcf28d

06 3月, 2015 1 次提交

x86/asm/entry: Switch all C consumers of kernel_stack to this_cpu_sp0() · 75182b16

由 Andy Lutomirski 提交于 3月 05, 2015

This will make modifying the semantics of kernel_stack easier.

The change to ist_begin_non_atomic() is necessary because sp0 no
longer points to the same THREAD_SIZE-aligned region as RSP;
it's one byte too high for that.  At Denys' suggestion, rather
than offsetting it, just check explicitly that we're in the
correct range ending at sp0.  This has the added benefit that we
no longer assume that the thread stack is aligned to
THREAD_SIZE.
Suggested-by: NDenys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ef8254ad414cbb8034c9a56396eeb24f5dd5b0de.1425611534.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

75182b16

05 3月, 2015 1 次提交

x86/traps: Separate set_intr_gate() and clean up early_trap_init() · 5eca7453

由 Wang Nan 提交于 2月 27, 2015

As early_trap_init() doesn't use IST, replace
set_intr_gate_ist() and set_system_intr_gate_ist() with their
standard counterparts.

set_intr_gate() requires a trace_debug symbol which we don't
have and won't use. This patch separates set_intr_gate() into two
parts, and uses base version in early_trap_init().
Reported-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NWang Nan <wangnan0@huawei.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Cc: <dave.hansen@linux.intel.com>
Cc: <lizefan@huawei.com>
Cc: <masami.hiramatsu.pt@hitachi.com>
Cc: <oleg@redhat.com>
Cc: <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1425010789-13714-1-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5eca7453

26 2月, 2015 1 次提交

x86/traps: Enable DEBUG_STACK after cpu_init() for TRAP_DB/BP · b4d83270

由 Wang Nan 提交于 2月 26, 2015

Before this patch early_trap_init() installs DEBUG_STACK for
X86_TRAP_BP and X86_TRAP_DB. However, DEBUG_STACK doesn't work
correctly until cpu_init() <-- trap_init().

This patch passes 0 to set_intr_gate_ist() and
set_system_intr_gate_ist() instead of DEBUG_STACK to let it use
same stack as kernel, and installs DEBUG_STACK for them in
trap_init().

As core runs at ring 0 between early_trap_init() and
trap_init(), there is no chance to get a bad stack before
trap_init().

As NMI is also enabled in trap_init(), we don't need to care
about is_debug_stack() and related things used in
arch/x86/kernel/nmi.c.
Signed-off-by: NWang Nan <wangnan0@huawei.com>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: <dave.hansen@linux.intel.com>
Cc: <lizefan@huawei.com>
Cc: <luto@amacapital.net>
Cc: <oleg@redhat.com>
Link: http://lkml.kernel.org/r/1424929779-13174-1-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b4d83270

19 2月, 2015 1 次提交

x86/fpu: Change math_error() to use unlazy_fpu(), kill (now) unused save_init_fpu() · 08a744c6

由 Oleg Nesterov 提交于 2月 06, 2015

math_error() calls save_init_fpu() after conditional_sti(), this means
that the caller can be preempted. If !use_eager_fpu() we can hit the
WARN_ON_ONCE(!__thread_has_fpu(tsk)) and/or save the wrong FPU state.

Change math_error() to use unlazy_fpu() and kill save_init_fpu().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NRik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1423252925-14451-4-git-send-email-riel@redhat.comSigned-off-by: NBorislav Petkov <bp@suse.de>

08a744c6

01 2月, 2015 1 次提交

x86, traps: Fix ist_enter from userspace · b926e6f6

由 Andy Lutomirski 提交于 1月 31, 2015

context_tracking_user_exit() has no effect if in_interrupt() returns true,
so ist_enter() didn't work. Fix it by calling exception_enter(), and thus
context_tracking_user_exit(), before incrementing the preempt count.

This also adds an assertion that will catch the problem reliably if
CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.

Link: http://lkml.kernel.org/r/261ebee6aee55a4724746d0d7024697013c40a08.1422709102.git.luto@amacapital.net
Fixes: 95927475 x86, traps: Track entry into and exit from IST context
Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>

b926e6f6

20 1月, 2015 1 次提交

x86, fpu: Fix math_state_restore() race with kernel_fpu_begin() · 7575637a

由 Oleg Nesterov 提交于 1月 15, 2015

math_state_restore() can race with kernel_fpu_begin() if irq comes
right after __thread_fpu_begin(), __save_init_fpu() will overwrite
fpu->state we are going to restore.

Add 2 simple helpers, kernel_fpu_disable() and kernel_fpu_enable()
which simply set/clear in_kernel_fpu, and change math_state_restore()
to exclude kernel_fpu_begin() in between.

Alternatively we could use local_irq_save/restore, but probably these
new helpers can have more users.

Perhaps they should disable/enable preemption themselves, in this case
we can remove preempt_disable() in __restore_xstate_sig().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: matt.fleming@intel.com
Cc: bp@suse.de
Cc: pbonzini@redhat.com
Cc: luto@amacapital.net
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150115192028.GD27332@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

7575637a

03 1月, 2015 3 次提交

x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic · bced35b6

由 Andy Lutomirski 提交于 11月 19, 2014

In some IST handlers, if the interrupt came from user mode,
we can safely enable preemption.  Add helpers to do it safely.

This is intended to be used my the memory failure code in
do_machine_check.
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>

bced35b6

x86, traps: Track entry into and exit from IST context · 95927475

由 Andy Lutomirski 提交于 11月 19, 2014

We currently pretend that IST context is like standard exception
context, but this is incorrect.  IST entries from userspace are like
standard exceptions except that they use per-cpu stacks, so they are
atomic.  IST entries from kernel space are like NMIs from RCU's
perspective -- they are not quiescent states even if they
interrupted the kernel during a quiescent state.

Add and use ist_enter and ist_exit to track IST context.  Even
though x86_32 has no IST stacks, we track these interrupts the same
way.

This fixes two issues:

 - Scheduling from an IST interrupt handler will now warn.  It would
   previously appear to work as long as we got lucky and nothing
   overwrote the stack frame.  (I don't know of any bugs in this
   that would trigger the warning, but it's good to be on the safe
   side.)

 - RCU handling in IST context was dangerous.  As far as I know,
   only machine checks were likely to trigger this, but it's good to
   be on the safe side.

Note that the machine check handlers appears to have been missing
any context tracking at all before this patch.

Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>

95927475

x86, entry: Switch stacks on a paranoid entry from userspace · 48e08d0f

由 Andy Lutomirski 提交于 11月 11, 2014

This causes all non-NMI, non-double-fault kernel entries from
userspace to run on the normal kernel stack.  Double-fault is
exempt to minimize confusion if we double-fault directly from
userspace due to a bad kernel stack.

This is, suprisingly, simpler and shorter than the current code.  It
removes the IMO rather frightening paranoid_userspace path, and it
make sync_regs much simpler.

There is no risk of stack overflow due to this change -- the kernel
stack that we switch to is empty.

This will also enable us to create non-atomic sections within
machine checks from userspace, which will simplify memory failure
handling.  It will also allow the upcoming fsgsbase code to be
simplified, because it doesn't need to worry about usergs when
scheduling in paranoid_exit, as that code no longer exists.

Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: NBorislav Petkov <bp@alien8.de>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>

48e08d0f

08 12月, 2014 1 次提交

x86_64/traps: Fix always true condition · e10abb2f

由 Dan Carpenter 提交于 11月 25, 2014

We should be checking IS_ERR() here.  PTR_ERR() is always true.

Fixes: fe3d197f ('x86, mpx: On-demand kernel allocation of
bounds tables')
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20141125172114.GA24535@mwandaSigned-off-by: NIngo Molnar <mingo@kernel.org>

e10abb2f

25 11月, 2014 1 次提交

x86/asm/traps: Disable tracing and kprobes in fixup_bad_iret and sync_regs · 7ddc6a21

由 Andy Lutomirski 提交于 11月 24, 2014

These functions can be executed on the int3 stack, so kprobes
are dangerous. Tracing is probably a bad idea, too.

Fixes: b645af2d ("x86_64, traps: Rework bad_iret")
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Cc: <stable@vger.kernel.org> # Backport as far back as it would apply
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/50e33d26adca60816f3ba968875801652507d0c4.1416870125.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

7ddc6a21

24 11月, 2014 3 次提交

x86_64, traps: Rework bad_iret · b645af2d

由 Andy Lutomirski 提交于 11月 22, 2014

It's possible for iretq to userspace to fail.  This can happen because
of a bad CS, SS, or RIP.

Historically, we've handled it by fixing up an exception from iretq to
land at bad_iret, which pretends that the failed iret frame was really
the hardware part of #GP(0) from userspace.  To make this work, there's
an extra fixup to fudge the gs base into a usable state.

This is suboptimal because it loses the original exception.  It's also
buggy because there's no guarantee that we were on the kernel stack to
begin with.  For example, if the failing iret happened on return from an
NMI, then we'll end up executing general_protection on the NMI stack.
This is bad for several reasons, the most immediate of which is that
general_protection, as a non-paranoid idtentry, will try to deliver
signals and/or schedule from the wrong stack.

This patch throws out bad_iret entirely.  As a replacement, it augments
the existing swapgs fudge into a full-blown iret fixup, mostly written
in C.  It's should be clearer and more correct.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b645af2d

x86_64, traps: Stop using IST for #SS · 6f442be2

由 Andy Lutomirski 提交于 11月 22, 2014

On a 32-bit kernel, this has no effect, since there are no IST stacks.

On a 64-bit kernel, #SS can only happen in user code, on a failed iret
to user space, a canonical violation on access via RSP or RBP, or a
genuine stack segment violation in 32-bit kernel code.  The first two
cases don't need IST, and the latter two cases are unlikely fatal bugs,
and promoting them to double faults would be fine.

This fixes a bug in which the espfix64 code mishandles a stack segment
violation.

This saves 4k of memory per CPU and a tiny bit of code.
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6f442be2

x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C · af726f21

由 Andy Lutomirski 提交于 11月 22, 2014

There's nothing special enough about the espfix64 double fault fixup to
justify writing it in assembly.  Move it to C.

This also fixes a bug: if the double fault came from an IST stack, the
old asm code would return to a partially uninitialized stack frame.

Fixes: 3891a04aSigned-off-by: NAndy Lutomirski <luto@amacapital.net>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

af726f21

18 11月, 2014 1 次提交

x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f

由 Dave Hansen 提交于 11月 14, 2014

This is really the meat of the MPX patch set.  If there is one patch to
review in the entire series, this is the one.  There is a new ABI here
and this kernel code also interacts with userspace memory in a
relatively unusual manner.  (small FAQ below).

Long Description:

This patch adds two prctl() commands to provide enable or disable the
management of bounds tables in kernel, including on-demand kernel
allocation (See the patch "on-demand kernel allocation of bounds tables")
and cleanup (See the patch "cleanup unused bound tables"). Applications
do not strictly need the kernel to manage bounds tables and we expect
some applications to use MPX without taking advantage of this kernel
support. This means the kernel can not simply infer whether an application
needs bounds table management from the MPX registers.  The prctl() is an
explicit signal from userspace.

PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
require kernel's help in managing bounds tables.

PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
won't allocate and free bounds tables even if the CPU supports MPX.

PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
directory out of a userspace register (bndcfgu) and then cache it into
a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
will set "bd_addr" to an invalid address.  Using this scheme, we can
use "bd_addr" to determine whether the management of bounds tables in
kernel is enabled.

Also, the only way to access that bndcfgu register is via an xsaves,
which can be expensive.  Caching "bd_addr" like this also helps reduce
the cost of those xsaves when doing table cleanup at munmap() time.
Unfortunately, we can not apply this optimization to #BR fault time
because we need an xsave to get the value of BNDSTATUS.

==== Why does the hardware even have these Bounds Tables? ====

MPX only has 4 hardware registers for storing bounds information.
If MPX-enabled code needs more than these 4 registers, it needs to
spill them somewhere. It has two special instructions for this
which allow the bounds to be moved between the bounds registers
and some new "bounds tables".

They are similar conceptually to a page fault and will be raised by
the MPX hardware during both bounds violations or when the tables
are not present. This patch handles those #BR exceptions for
not-present tables by carving the space out of the normal processes
address space (essentially calling the new mmap() interface indroduced
earlier in this patch set.) and then pointing the bounds-directory
over to it.

The tables *need* to be accessed and controlled by userspace because
the instructions for moving bounds in and out of them are extremely
frequent. They potentially happen every time a register pointing to
memory is dereferenced. Any direct kernel involvement (like a syscall)
to access the tables would obviously destroy performance.

==== Why not do this in userspace? ====

This patch is obviously doing this allocation in the kernel.
However, MPX does not strictly *require* anything in the kernel.
It can theoretically be done completely from userspace. Here are
a few ways this *could* be done. I don't think any of them are
practical in the real-world, but here they are.

Q: Can virtual space simply be reserved for the bounds tables so
   that we never have to allocate them?
A: As noted earlier, these tables are *HUGE*. An X-GB virtual
   area needs 4*X GB of virtual space, plus 2GB for the bounds
   directory. If we were to preallocate them for the 128TB of
   user virtual address space, we would need to reserve 512TB+2GB,
   which is larger than the entire virtual address space today.
   This means they can not be reserved ahead of time. Also, a
   single process's pre-popualated bounds directory consumes 2GB
   of virtual *AND* physical memory. IOW, it's completely
   infeasible to prepopulate bounds directories.

Q: Can we preallocate bounds table space at the same time memory
   is allocated which might contain pointers that might eventually
   need bounds tables?
A: This would work if we could hook the site of each and every
   memory allocation syscall. This can be done for small,
   constrained applications. But, it isn't practical at a larger
   scale since a given app has no way of controlling how all the
   parts of the app might allocate memory (think libraries). The
   kernel is really the only place to intercept these calls.

Q: Could a bounds fault be handed to userspace and the tables
   allocated there in a signal handler instead of in the kernel?
A: (thanks to tglx) mmap() is not on the list of safe async
   handler functions and even if mmap() would work it still
   requires locking or nasty tricks to keep track of the
   allocation state there.

Having ruled out all of the userspace-only approaches for managing
bounds tables that we could think of, we create them on demand in
the kernel.
Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Cc: linux-mm@kvack.org
Cc: linux-mips@linux-mips.org
Cc: Dave Hansen <dave@sr71.net>
Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

fe3d197f

14 6月, 2014 1 次提交

x86/kprobes: Fix build errors and blacklist context_track_user · 4cdf77a8

由 Masami Hiramatsu 提交于 6月 14, 2014

This essentially reverts commit:

  ecd50f71 ("kprobes, x86: Call exception_enter after kprobes handled")

since it causes build errors with CONFIG_CONTEXT_TRACKING and
that has been made from misunderstandings;
context_track_user_*() don't involve much in interrupt context,
it just returns if in_interrupt() is true.

Instead of changing the do_debug/int3(), this just adds
context_track_user_*() to kprobes blacklist, since those are
still can be called right before kprobes handles int3 and debug
exceptions, and probing those will cause an infinite loop.
Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Link: http://lkml.kernel.org/r/20140614064711.7865.45957.stgit@kbuild-fedora.novalocalSigned-off-by: NIngo Molnar <mingo@kernel.org>

4cdf77a8

14 5月, 2014 7 次提交

uprobes/x86: Fix the wrong ->si_addr when xol triggers a trap · b02ef20a

由 Oleg Nesterov 提交于 5月 12, 2014

If the probed insn triggers a trap, ->si_addr = regs->ip is technically
correct, but this is not what the signal handler wants; we need to pass
the address of the probed insn, not the address of xol slot.

Add the new arch-agnostic helper, uprobe_get_trap_addr(), and change
fill_trap_info() and math_error() to use it. !CONFIG_UPROBES case in
uprobes.h uses a macro to avoid include hell and ensure that it can be
compiled even if an architecture doesn't define instruction_pointer().

Test-case:

	#include <signal.h>
	#include <stdio.h>
	#include <unistd.h>

	extern void probe_div(void);

	void sigh(int sig, siginfo_t *info, void *c)
	{
		int passed = (info->si_addr == probe_div);
		printf(passed ? "PASS\n" : "FAIL\n");
		_exit(!passed);
	}

	int main(void)
	{
		struct sigaction sa = {
			.sa_sigaction	= sigh,
			.sa_flags	= SA_SIGINFO,
		};

		sigaction(SIGFPE, &sa, NULL);

		asm (
			"xor %ecx,%ecx\n"
			".globl probe_div; probe_div:\n"
			"idiv %ecx\n"
		);

		return 0;
	}

it fails if probe_div() is probed.

Note: show_unhandled_signals users should probably use this helper too,
but we need to cleanup them first.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

b02ef20a

x86/traps: Kill DO_ERROR_INFO() · 0eb14833

由 Oleg Nesterov 提交于 5月 08, 2014

Now that DO_ERROR_INFO() doesn't differ from DO_ERROR() we can remove
it and use DO_ERROR() instead.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>

0eb14833

x86/traps: Shift fill_trap_info() from DO_ERROR_INFO() to do_error_trap() · 1c326c4d

由 Oleg Nesterov 提交于 5月 08, 2014

Move the callsite of fill_trap_info() into do_error_trap() and remove
the "siginfo_t *info" argument.

This obviously breaks DO_ERROR() which passed info == NULL, we simply
change fill_trap_info() to return "siginfo_t *" and add the "default"
case which returns SEND_SIG_PRIV.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>

1c326c4d

x86/traps: Introduce fill_trap_info(), simplify DO_ERROR_INFO() · 958d3d72

由 Oleg Nesterov 提交于 5月 07, 2014

Extract the fill-siginfo code from DO_ERROR_INFO() into the new helper,
fill_trap_info().

It can calculate si_code and si_addr looking at trapnr, so we can remove
these arguments from DO_ERROR_INFO() and simplify the source code. The
generated code is the same, __builtin_constant_p(trapnr) == T.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>

958d3d72

x86/traps: Introduce do_error_trap() · dff0796e

由 Oleg Nesterov 提交于 5月 07, 2014

Move the common code from DO_ERROR() and DO_ERROR_INFO() into the new
helper, do_error_trap(). This simplifies define's and shaves 527 bytes
from traps.o.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>

dff0796e

x86/traps: Use SEND_SIG_PRIV instead of force_sig() · 38cad57b

由 Oleg Nesterov 提交于 5月 07, 2014

force_sig() is just force_sig_info(SEND_SIG_PRIV). Imho it should die,
we have too many ugly "send signal" helpers.

And do_trap() looks just ugly because it uses force_sig_info() or
force_sig() depending on info != NULL.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>

38cad57b

O
x86/traps: Make math_error() static · 5e1b05be
由 Oleg Nesterov 提交于 5月 08, 2014
```
Trivial, make math_error() static.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
```
5e1b05be

06 5月, 2014 1 次提交

asmlinkage, x86: Add explicit __visible to arch/x86/* · 2605fc21

由 Andi Kleen 提交于 5月 02, 2014

As requested by Linus add explicit __visible to the asmlinkage users.
This marks all functions visible to assembler.

Tree sweep for arch/x86/*
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-3-git-send-email-andi@firstfloor.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>

2605fc21

24 4月, 2014 3 次提交

kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation · 9326638c

由 Masami Hiramatsu 提交于 4月 17, 2014

Use NOKPROBE_SYMBOL macro for protecting functions
from kprobes instead of __kprobes annotation under
arch/x86.

This applies nokprobe_inline annotation for some cases,
because NOKPROBE_SYMBOL() will inhibit inlining by
referring the symbol address.

This just folds a bunch of previous NOKPROBE_SYMBOL()
cleanup patches for x86 to one patch.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

9326638c

kprobes, x86: Call exception_enter after kprobes handled · ecd50f71

由 Masami Hiramatsu 提交于 4月 17, 2014

Move exception_enter() call after kprobes handler
is done. Since the exception_enter() involves
many other functions (like printk), it can cause
recursive int3/break loop when kprobes probe such
functions.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081740.26341.10894.stgit@ltc230.yrl.intra.hitachi.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>

ecd50f71

kprobes/x86: Call exception handlers directly from do_int3/do_debug · 6f6343f5

由 Masami Hiramatsu 提交于 4月 17, 2014

To avoid a kernel crash by probing on lockdep code, call
kprobe_int3_handler() and kprobe_debug_handler()(which was
formerly called post_kprobe_handler()) directly from
do_int3 and do_debug.

Currently kprobes uses notify_die() to hook the int3/debug
exceptoins. Since there is a locking code in notify_die,
the lockdep code can be invoked. And because the lockdep
involves printk() related things, theoretically, we need to
prohibit probing on such code, which means much longer blacklist
we'll have. Instead, hooking the int3/debug for kprobes before
notify_die() can avoid this problem.

Anyway, most of the int3 handlers in the kernel are already
called from do_int3 directly, e.g. ftrace_int3_handler,
poke_int3_handler, kgdb_ll_trap. Actually only
kprobe_exceptions_notify is on the notifier_call_chain.
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Lebon <jlebon@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20140417081733.26341.24423.stgit@ltc230.yrl.intra.hitachi.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>

6f6343f5

12 12月, 2013 1 次提交

x86/traps: Clean up error exception handler definitions · d8af4ce4

由 Ingo Molnar 提交于 12月 12, 2013

So I was reading the exception handler generation code and got a real
headache looking at the unstructured mess that our DO_ERROR*()
generation code is today.

Make it more readable.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-kuabysiykvUJpgus35lhnhvs@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

d8af4ce4

13 11月, 2013 1 次提交

x86: move fpu_counter into ARCH specific thread_struct · c375f15a

由 Vineet Gupta 提交于 11月 12, 2013

Only a couple of arches (sh/x86) use fpu_counter in task_struct so it can
be moved out into ARCH specific thread_struct, reducing the size of
task_struct for other arches.

Compile tested i386_defconfig + gcc 4.7.3
Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
Acked-by: NIngo Molnar <mingo@kernel.org>
Cc: Paul Mundt <paul.mundt@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c375f15a

_Walt / cloud-kernel 与 Fork 源项目一致

_Walt / cloud-kernel
与 Fork 源项目一致