1. 04 2月, 2015 1 次提交
  2. 01 2月, 2015 3 次提交
    • A
      x86_64, entry: Remove the syscall exit audit and schedule optimizations · 96b6352c
      Andy Lutomirski 提交于
      We used to optimize rescheduling and audit on syscall exit.  Now
      that the full slow path is reasonably fast, remove these
      optimizations.  Syscall exit auditing is now handled exclusively by
      syscall_trace_leave.
      
      This adds something like 10ns to the previously optimized paths on
      my computer, presumably due mostly to SAVE_REST / RESTORE_REST.
      
      I think that we should eventually replace both the syscall and
      non-paranoid interrupt exit slow paths with a pair of C functions
      along the lines of the syscall entry hooks.
      
      Link: http://lkml.kernel.org/r/22f2aa4a0361707a5cfb1de9d45260b39965dead.1421453410.git.luto@amacapital.netAcked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      96b6352c
    • A
      x86_64, entry: Use sysret to return to userspace when possible · 2a23c6b8
      Andy Lutomirski 提交于
      The x86_64 entry code currently jumps through complex and
      inconsistent hoops to try to minimize the impact of syscall exit
      work.  For a true fast-path syscall, almost nothing needs to be
      done, so returning is just a check for exit work and sysret.  For a
      full slow-path return from a syscall, the C exit hook is invoked if
      needed and we join the iret path.
      
      Using iret to return to userspace is very slow, so the entry code
      has accumulated various special cases to try to do certain forms of
      exit work without invoking iret.  This is error-prone, since it
      duplicates assembly code paths, and it's dangerous, since sysret
      can malfunction in interesting ways if used carelessly.  It's
      also inefficient, since a lot of useful cases aren't optimized
      and therefore force an iret out of a combination of paranoia and
      the fact that no one has bothered to write even more asm code
      to avoid it.
      
      I would argue that this approach is backwards.  Rather than trying
      to avoid the iret path, we should instead try to make the iret path
      fast.  Under a specific set of conditions, iret is unnecessary.  In
      particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
      set, and both SS and CS are as expected, then
      movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
      conditions is nearly always satisfied on return from syscalls, and
      it can even occasionally be satisfied on return from an irq.
      
      Even with the careful checks for sysret applicability, this cuts
      nearly 80ns off of the overhead from syscalls with unoptimized exit
      work.  This includes tracing and context tracking, and any return
      that invokes KVM's user return notifier.  For example, the cost of
      getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
      ~280ns on my computer.
      
      This may allow the removal and even eventual conversion to C
      of a respectable amount of exit asm.
      
      This may require further tweaking to give the full benefit on Xen.
      
      It may be worthwhile to adjust signal delivery and exec to try hit
      the sysret path.
      
      This does not optimize returns to 32-bit userspace.  Making the same
      optimization for CS == __USER32_CS is conceptually straightforward,
      but it will require some tedious code to handle the differences
      between sysretl and sysexitl.
      
      Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.netSigned-off-by: NAndy Lutomirski <luto@amacapital.net>
      2a23c6b8
    • A
      x86, traps: Fix ist_enter from userspace · b926e6f6
      Andy Lutomirski 提交于
      context_tracking_user_exit() has no effect if in_interrupt() returns true,
      so ist_enter() didn't work.  Fix it by calling exception_enter(), and thus
      context_tracking_user_exit(), before incrementing the preempt count.
      
      This also adds an assertion that will catch the problem reliably if
      CONFIG_PROVE_RCU=y to help prevent the bug from being reintroduced.
      
      Link: http://lkml.kernel.org/r/261ebee6aee55a4724746d0d7024697013c40a08.1422709102.git.luto@amacapital.net
      Fixes: 95927475 x86, traps: Track entry into and exit from IST context
      Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      b926e6f6
  3. 28 1月, 2015 3 次提交
  4. 23 1月, 2015 2 次提交
  5. 20 1月, 2015 4 次提交
  6. 17 1月, 2015 1 次提交
    • A
      x86_64 entry: Fix RCX for ptraced syscalls · 0fcedc86
      Andy Lutomirski 提交于
      The int_ret_from_sys_call and syscall tracing code disagrees
      with the sysret path as to the value of RCX.
      
      The Intel SDM, the AMD APM, and my laptop all agree that sysret
      returns with RCX == RIP.  The syscall tracing code does not
      respect this property.
      
      For example, this program:
      
      int main()
      {
      	extern const char syscall_rip[];
      	unsigned long rcx = 1;
      	unsigned long orig_rcx = rcx;
      	asm ("mov $-1, %%eax\n\t"
      	     "syscall\n\t"
      	     "syscall_rip:"
      	     : "+c" (rcx) : : "r11");
      	printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
      	       rcx, (unsigned long)syscall_rip, orig_rcx);
      	return 0;
      }
      
      prints:
      
        syscall: RCX = 400556  RIP = 400556  orig RCX = 1
      
      Running it under strace gives this instead:
      
        syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1
      
      This changes FIXUP_TOP_OF_STACK to match sysret, causing the
      test to show RCX == RIP even under strace.
      
      It looks like this is a partial revert of:
      88e4bc32686e ("[PATCH] x86-64 architecture specific sync for 2.5.8")
      from the historic git tree.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/c9a418c3dc3993cb88bb7773800225fd318a4c67.1421453410.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0fcedc86
  7. 16 1月, 2015 2 次提交
  8. 15 1月, 2015 1 次提交
    • S
      ftrace/jprobes/x86: Fix conflict between jprobes and function graph tracing · 237d28db
      Steven Rostedt (Red Hat) 提交于
      If the function graph tracer traces a jprobe callback, the system will
      crash. This can easily be demonstrated by compiling the jprobe
      sample module that is in the kernel tree, loading it and running the
      function graph tracer.
      
       # modprobe jprobe_example.ko
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
       # ls
      
      The first two commands end up in a nice crash after the first fork.
      (do_fork has a jprobe attached to it, so "ls" just triggers that fork)
      
      The problem is caused by the jprobe_return() that all jprobe callbacks
      must end with. The way jprobes works is that the function a jprobe
      is attached to has a breakpoint placed at the start of it (or it uses
      ftrace if fentry is supported). The breakpoint handler (or ftrace callback)
      will copy the stack frame and change the ip address to return to the
      jprobe handler instead of the function. The jprobe handler must end
      with jprobe_return() which swaps the stack and does an int3 (breakpoint).
      This breakpoint handler will then put back the saved stack frame,
      simulate the instruction at the beginning of the function it added
      a breakpoint to, and then continue on.
      
      For function tracing to work, it hijakes the return address from the
      stack frame, and replaces it with a hook function that will trace
      the end of the call. This hook function will restore the return
      address of the function call.
      
      If the function tracer traces the jprobe handler, the hook function
      for that handler will not be called, and its saved return address
      will be used for the next function. This will result in a kernel crash.
      
      To solve this, pause function tracing before the jprobe handler is called
      and unpause it before it returns back to the function it probed.
      
      Some other updates:
      
      Used a variable "saved_sp" to hold kcb->jprobe_saved_sp. This makes the
      code look a bit cleaner and easier to understand (various tries to fix
      this bug required this change).
      
      Note, if fentry is being used, jprobes will change the ip address before
      the function graph tracer runs and it will not be able to trace the
      function that the jprobe is probing.
      
      Link: http://lkml.kernel.org/r/20150114154329.552437962@goodmis.org
      
      Cc: stable@vger.kernel.org # 2.6.30+
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      237d28db
  9. 14 1月, 2015 2 次提交
  10. 09 1月, 2015 3 次提交
  11. 07 1月, 2015 1 次提交
  12. 06 1月, 2015 1 次提交
  13. 03 1月, 2015 4 次提交
    • A
      x86, traps: Add ist_begin_non_atomic and ist_end_non_atomic · bced35b6
      Andy Lutomirski 提交于
      In some IST handlers, if the interrupt came from user mode,
      we can safely enable preemption.  Add helpers to do it safely.
      
      This is intended to be used my the memory failure code in
      do_machine_check.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      bced35b6
    • A
      x86: Clean up current_stack_pointer · 83653c16
      Andy Lutomirski 提交于
      There's no good reason for it to be a macro, and x86_64 will want to
      use it, so it should be in a header.
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      83653c16
    • A
      x86, traps: Track entry into and exit from IST context · 95927475
      Andy Lutomirski 提交于
      We currently pretend that IST context is like standard exception
      context, but this is incorrect.  IST entries from userspace are like
      standard exceptions except that they use per-cpu stacks, so they are
      atomic.  IST entries from kernel space are like NMIs from RCU's
      perspective -- they are not quiescent states even if they
      interrupted the kernel during a quiescent state.
      
      Add and use ist_enter and ist_exit to track IST context.  Even
      though x86_32 has no IST stacks, we track these interrupts the same
      way.
      
      This fixes two issues:
      
       - Scheduling from an IST interrupt handler will now warn.  It would
         previously appear to work as long as we got lucky and nothing
         overwrote the stack frame.  (I don't know of any bugs in this
         that would trigger the warning, but it's good to be on the safe
         side.)
      
       - RCU handling in IST context was dangerous.  As far as I know,
         only machine checks were likely to trigger this, but it's good to
         be on the safe side.
      
      Note that the machine check handlers appears to have been missing
      any context tracking at all before this patch.
      
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      95927475
    • A
      x86, entry: Switch stacks on a paranoid entry from userspace · 48e08d0f
      Andy Lutomirski 提交于
      This causes all non-NMI, non-double-fault kernel entries from
      userspace to run on the normal kernel stack.  Double-fault is
      exempt to minimize confusion if we double-fault directly from
      userspace due to a bad kernel stack.
      
      This is, suprisingly, simpler and shorter than the current code.  It
      removes the IMO rather frightening paranoid_userspace path, and it
      make sync_regs much simpler.
      
      There is no risk of stack overflow due to this change -- the kernel
      stack that we switch to is empty.
      
      This will also enable us to create non-atomic sections within
      machine checks from userspace, which will simplify memory failure
      handling.  It will also allow the upcoming fsgsbase code to be
      simplified, because it doesn't need to worry about usergs when
      scheduling in paranoid_exit, as that code no longer exists.
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      48e08d0f
  14. 23 12月, 2014 2 次提交
  15. 18 12月, 2014 1 次提交
  16. 16 12月, 2014 9 次提交