1. 08 6月, 2015 2 次提交
    • I
      x86/asm/entry: Untangle 'system_call' into two entry points: entry_SYSCALL_64 and entry_INT80_32 · b2502b41
      Ingo Molnar 提交于
      The 'system_call' entry points differ starkly between native 32-bit and 64-bit
      kernels: on 32-bit kernels it defines the INT 0x80 entry point, while on
      64-bit it's the SYSCALL entry point.
      
      This is pretty confusing when looking at generic code, and it also obscures
      the nature of the entry point at the assembly level.
      
      So unangle this by splitting the name into its two uses:
      
      	system_call (32) -> entry_INT80_32
      	system_call (64) -> entry_SYSCALL_64
      
      As per the generic naming scheme for x86 system call entry points:
      
      	entry_MNEMONIC_qualifier
      
      where 'qualifier' is one of _32, _64 or _compat.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b2502b41
    • I
      x86/asm/entry: Rename compat syscall entry points · 2cd23553
      Ingo Molnar 提交于
      Rename the following system call entry points:
      
      	ia32_cstar_target       -> entry_SYSCALL_compat
      	ia32_syscall            -> entry_INT80_compat
      
      The generic naming scheme for x86 system call entry points is:
      
      	entry_MNEMONIC_qualifier
      
      where 'qualifier' is one of _32, _64 or _compat.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2cd23553
  2. 10 5月, 2015 1 次提交
  3. 31 3月, 2015 1 次提交
    • I
      x86/asm/entry: Remove user_mode_ignore_vm86() · 55474c48
      Ingo Molnar 提交于
      user_mode_ignore_vm86() can be used instead of user_mode(), in
      places where we have already done a v8086_mode() security
      check of ptregs.
      
      But doing this check in the wrong place would be a bug that
      could result in security problems, and also the naming still
      isn't very clear.
      
      Furthermore, it only affects 32-bit kernels, while most
      development happens on 64-bit kernels.
      
      If we replace them with user_mode() checks then the cost is only
      a very minor increase in various slowpaths:
      
         text             data   bss     dec              hex    filename
         10573391         703562 1753042 13029995         c6d26b vmlinux.o.before
         10573423         703562 1753042 13030027         c6d28b vmlinux.o.after
      
      So lets get rid of this distinction once and for all.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150329090233.GA1963@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      55474c48
  4. 23 3月, 2015 4 次提交
  5. 10 3月, 2015 1 次提交
  6. 09 3月, 2015 1 次提交
    • F
      context_tracking: Rename context symbols to prepare for transition state · c467ea76
      Frederic Weisbecker 提交于
      Current context tracking symbols are designed to express living state.
      As such they are prefixed with "IN_": IN_USER, IN_KERNEL.
      
      Now we are going to use these symbols to also express state transitions
      such as context_tracking_enter(IN_USER) or context_tracking_exit(IN_USER).
      But while the "IN_" prefix works well to express entering a context, it's
      confusing to depict a context exit: context_tracking_exit(IN_USER)
      could mean two things:
      	1) We are exiting the current context to enter user context.
      	2) We are exiting the user context
      We want 2) but the reviewer may be confused and understand 1)
      
      So lets disambiguate these symbols and rename them to CONTEXT_USER and
      CONTEXT_KERNEL.
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will deacon <will.deacon@arm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      c467ea76
  7. 07 3月, 2015 1 次提交
  8. 06 3月, 2015 1 次提交
  9. 05 3月, 2015 1 次提交
  10. 26 2月, 2015 1 次提交
  11. 19 2月, 2015 1 次提交
  12. 01 2月, 2015 1 次提交
  13. 20 1月, 2015 1 次提交
  14. 03 1月, 2015 3 次提交
    • A
      x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic · bced35b6
      Andy Lutomirski 提交于
      In some IST handlers, if the interrupt came from user mode,
      we can safely enable preemption.  Add helpers to do it safely.
      
      This is intended to be used my the memory failure code in
      do_machine_check.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      bced35b6
    • A
      x86, traps: Track entry into and exit from IST context · 95927475
      Andy Lutomirski 提交于
      We currently pretend that IST context is like standard exception
      context, but this is incorrect.  IST entries from userspace are like
      standard exceptions except that they use per-cpu stacks, so they are
      atomic.  IST entries from kernel space are like NMIs from RCU's
      perspective -- they are not quiescent states even if they
      interrupted the kernel during a quiescent state.
      
      Add and use ist_enter and ist_exit to track IST context.  Even
      though x86_32 has no IST stacks, we track these interrupts the same
      way.
      
      This fixes two issues:
      
       - Scheduling from an IST interrupt handler will now warn.  It would
         previously appear to work as long as we got lucky and nothing
         overwrote the stack frame.  (I don't know of any bugs in this
         that would trigger the warning, but it's good to be on the safe
         side.)
      
       - RCU handling in IST context was dangerous.  As far as I know,
         only machine checks were likely to trigger this, but it's good to
         be on the safe side.
      
      Note that the machine check handlers appears to have been missing
      any context tracking at all before this patch.
      
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      95927475
    • A
      x86, entry: Switch stacks on a paranoid entry from userspace · 48e08d0f
      Andy Lutomirski 提交于
      This causes all non-NMI, non-double-fault kernel entries from
      userspace to run on the normal kernel stack.  Double-fault is
      exempt to minimize confusion if we double-fault directly from
      userspace due to a bad kernel stack.
      
      This is, suprisingly, simpler and shorter than the current code.  It
      removes the IMO rather frightening paranoid_userspace path, and it
      make sync_regs much simpler.
      
      There is no risk of stack overflow due to this change -- the kernel
      stack that we switch to is empty.
      
      This will also enable us to create non-atomic sections within
      machine checks from userspace, which will simplify memory failure
      handling.  It will also allow the upcoming fsgsbase code to be
      simplified, because it doesn't need to worry about usergs when
      scheduling in paranoid_exit, as that code no longer exists.
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      48e08d0f
  15. 08 12月, 2014 1 次提交
  16. 25 11月, 2014 1 次提交
  17. 24 11月, 2014 3 次提交
    • A
      x86_64, traps: Rework bad_iret · b645af2d
      Andy Lutomirski 提交于
      It's possible for iretq to userspace to fail.  This can happen because
      of a bad CS, SS, or RIP.
      
      Historically, we've handled it by fixing up an exception from iretq to
      land at bad_iret, which pretends that the failed iret frame was really
      the hardware part of #GP(0) from userspace.  To make this work, there's
      an extra fixup to fudge the gs base into a usable state.
      
      This is suboptimal because it loses the original exception.  It's also
      buggy because there's no guarantee that we were on the kernel stack to
      begin with.  For example, if the failing iret happened on return from an
      NMI, then we'll end up executing general_protection on the NMI stack.
      This is bad for several reasons, the most immediate of which is that
      general_protection, as a non-paranoid idtentry, will try to deliver
      signals and/or schedule from the wrong stack.
      
      This patch throws out bad_iret entirely.  As a replacement, it augments
      the existing swapgs fudge into a full-blown iret fixup, mostly written
      in C.  It's should be clearer and more correct.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b645af2d
    • A
      x86_64, traps: Stop using IST for #SS · 6f442be2
      Andy Lutomirski 提交于
      On a 32-bit kernel, this has no effect, since there are no IST stacks.
      
      On a 64-bit kernel, #SS can only happen in user code, on a failed iret
      to user space, a canonical violation on access via RSP or RBP, or a
      genuine stack segment violation in 32-bit kernel code.  The first two
      cases don't need IST, and the latter two cases are unlikely fatal bugs,
      and promoting them to double faults would be fine.
      
      This fixes a bug in which the espfix64 code mishandles a stack segment
      violation.
      
      This saves 4k of memory per CPU and a tiny bit of code.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f442be2
    • A
      x86_64, traps: Fix the espfix64 #DF fixup and rewrite it in C · af726f21
      Andy Lutomirski 提交于
      There's nothing special enough about the espfix64 double fault fixup to
      justify writing it in assembly.  Move it to C.
      
      This also fixes a bug: if the double fault came from an IST stack, the
      old asm code would return to a partially uninitialized stack frame.
      
      Fixes: 3891a04aSigned-off-by: NAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af726f21
  18. 18 11月, 2014 1 次提交
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
  19. 14 6月, 2014 1 次提交
  20. 14 5月, 2014 7 次提交
  21. 06 5月, 2014 1 次提交
  22. 24 4月, 2014 3 次提交
    • M
      kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotation · 9326638c
      Masami Hiramatsu 提交于
      Use NOKPROBE_SYMBOL macro for protecting functions
      from kprobes instead of __kprobes annotation under
      arch/x86.
      
      This applies nokprobe_inline annotation for some cases,
      because NOKPROBE_SYMBOL() will inhibit inlining by
      referring the symbol address.
      
      This just folds a bunch of previous NOKPROBE_SYMBOL()
      cleanup patches for x86 to one patch.
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Lebon <jlebon@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matt Fleming <matt.fleming@intel.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9326638c
    • M
      kprobes, x86: Call exception_enter after kprobes handled · ecd50f71
      Masami Hiramatsu 提交于
      Move exception_enter() call after kprobes handler
      is done. Since the exception_enter() involves
      many other functions (like printk), it can cause
      recursive int3/break loop when kprobes probe such
      functions.
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Link: http://lkml.kernel.org/r/20140417081740.26341.10894.stgit@ltc230.yrl.intra.hitachi.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ecd50f71
    • M
      kprobes/x86: Call exception handlers directly from do_int3/do_debug · 6f6343f5
      Masami Hiramatsu 提交于
      To avoid a kernel crash by probing on lockdep code, call
      kprobe_int3_handler() and kprobe_debug_handler()(which was
      formerly called post_kprobe_handler()) directly from
      do_int3 and do_debug.
      
      Currently kprobes uses notify_die() to hook the int3/debug
      exceptoins. Since there is a locking code in notify_die,
      the lockdep code can be invoked. And because the lockdep
      involves printk() related things, theoretically, we need to
      prohibit probing on such code, which means much longer blacklist
      we'll have. Instead, hooking the int3/debug for kprobes before
      notify_die() can avoid this problem.
      
      Anyway, most of the int3 handlers in the kernel are already
      called from do_int3 directly, e.g. ftrace_int3_handler,
      poke_int3_handler, kgdb_ll_trap. Actually only
      kprobe_exceptions_notify is on the notifier_call_chain.
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Jonathan Lebon <jlebon@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seiji Aguchi <seiji.aguchi@hds.com>
      Link: http://lkml.kernel.org/r/20140417081733.26341.24423.stgit@ltc230.yrl.intra.hitachi.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6f6343f5
  23. 12 12月, 2013 1 次提交
  24. 13 11月, 2013 1 次提交