1. 25 3月, 2015 8 次提交
  2. 23 3月, 2015 1 次提交
  3. 17 3月, 2015 3 次提交
    • I
      x86/asm/entry/64: Rename 'old_rsp' to 'rsp_scratch' · c38e5038
      Ingo Molnar 提交于
      Make clear that the usage of PER_CPU(old_rsp) is purely temporary,
      by renaming it to 'rsp_scratch'.
      
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c38e5038
    • I
      x86/asm/entry/64: Update comments about stack frames · 7fcb3bc3
      Ingo Molnar 提交于
      Tweak a few outdated comments that were obsoleted by recent changes
      to syscall entry code:
      
       - we no longer have a "partial stack frame" on
         entry, ever.
      
       - explain the syscall entry usage of old_rsp.
      
      Partially based on a (split out of) patch from Denys Vlasenko.
      
      Originally-from: Denys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7fcb3bc3
    • D
      x86/asm/entry/64: Enable interrupts *after* we fetch PER_CPU_VAR(old_rsp) · 33db1fd4
      Denys Vlasenko 提交于
      We want to use PER_CPU_VAR(old_rsp) as a simple temporary register,
      to shuffle user-space RSP into (and from) when we set up the system
      call stack frame. At that point we cannot shuffle values into general
      purpose registers, because we have not saved them yet.
      
      To be able to do this shuffling into a memory location, we must be
      atomic and must not be preempted while we do the shuffling, otherwise
      the 'temporary' register gets overwritten by some other task's
      temporary register contents ...
      Tested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1426600344-8254-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33db1fd4
  4. 10 3月, 2015 3 次提交
  5. 06 3月, 2015 2 次提交
  6. 05 3月, 2015 13 次提交
  7. 24 2月, 2015 1 次提交
    • D
      x86/xen: allow privcmd hypercalls to be preempted · fdfd811d
      David Vrabel 提交于
      Hypercalls submitted by user space tools via the privcmd driver can
      take a long time (potentially many 10s of seconds) if the hypercall
      has many sub-operations.
      
      A fully preemptible kernel may deschedule such as task in any upcall
      called from a hypercall continuation.
      
      However, in a kernel with voluntary or no preemption, hypercall
      continuations in Xen allow event handlers to be run but the task
      issuing the hypercall will not be descheduled until the hypercall is
      complete and the ioctl returns to user space.  These long running
      tasks may also trigger the kernel's soft lockup detection.
      
      Add xen_preemptible_hcall_begin() and xen_preemptible_hcall_end() to
      bracket hypercalls that may be preempted.  Use these in the privcmd
      driver.
      
      When returning from an upcall, call xen_maybe_preempt_hcall() which
      adds a schedule point if if the current task was within a preemptible
      hypercall.
      
      Since _cond_resched() can move the task to a different CPU, clear and
      set xen_in_preemptible_hcall around the call.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      fdfd811d
  8. 01 2月, 2015 2 次提交
    • A
      x86_64, entry: Remove the syscall exit audit and schedule optimizations · 96b6352c
      Andy Lutomirski 提交于
      We used to optimize rescheduling and audit on syscall exit.  Now
      that the full slow path is reasonably fast, remove these
      optimizations.  Syscall exit auditing is now handled exclusively by
      syscall_trace_leave.
      
      This adds something like 10ns to the previously optimized paths on
      my computer, presumably due mostly to SAVE_REST / RESTORE_REST.
      
      I think that we should eventually replace both the syscall and
      non-paranoid interrupt exit slow paths with a pair of C functions
      along the lines of the syscall entry hooks.
      
      Link: http://lkml.kernel.org/r/22f2aa4a0361707a5cfb1de9d45260b39965dead.1421453410.git.luto@amacapital.netAcked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      96b6352c
    • A
      x86_64, entry: Use sysret to return to userspace when possible · 2a23c6b8
      Andy Lutomirski 提交于
      The x86_64 entry code currently jumps through complex and
      inconsistent hoops to try to minimize the impact of syscall exit
      work.  For a true fast-path syscall, almost nothing needs to be
      done, so returning is just a check for exit work and sysret.  For a
      full slow-path return from a syscall, the C exit hook is invoked if
      needed and we join the iret path.
      
      Using iret to return to userspace is very slow, so the entry code
      has accumulated various special cases to try to do certain forms of
      exit work without invoking iret.  This is error-prone, since it
      duplicates assembly code paths, and it's dangerous, since sysret
      can malfunction in interesting ways if used carelessly.  It's
      also inefficient, since a lot of useful cases aren't optimized
      and therefore force an iret out of a combination of paranoia and
      the fact that no one has bothered to write even more asm code
      to avoid it.
      
      I would argue that this approach is backwards.  Rather than trying
      to avoid the iret path, we should instead try to make the iret path
      fast.  Under a specific set of conditions, iret is unnecessary.  In
      particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
      set, and both SS and CS are as expected, then
      movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
      conditions is nearly always satisfied on return from syscalls, and
      it can even occasionally be satisfied on return from an irq.
      
      Even with the careful checks for sysret applicability, this cuts
      nearly 80ns off of the overhead from syscalls with unoptimized exit
      work.  This includes tracing and context tracking, and any return
      that invokes KVM's user return notifier.  For example, the cost of
      getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
      ~280ns on my computer.
      
      This may allow the removal and even eventual conversion to C
      of a respectable amount of exit asm.
      
      This may require further tweaking to give the full benefit on Xen.
      
      It may be worthwhile to adjust signal delivery and exec to try hit
      the sysret path.
      
      This does not optimize returns to 32-bit userspace.  Making the same
      optimization for CS == __USER32_CS is conceptually straightforward,
      but it will require some tedious code to handle the differences
      between sysretl and sysexitl.
      
      Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.netSigned-off-by: NAndy Lutomirski <luto@amacapital.net>
      2a23c6b8
  9. 17 1月, 2015 1 次提交
    • A
      x86_64 entry: Fix RCX for ptraced syscalls · 0fcedc86
      Andy Lutomirski 提交于
      The int_ret_from_sys_call and syscall tracing code disagrees
      with the sysret path as to the value of RCX.
      
      The Intel SDM, the AMD APM, and my laptop all agree that sysret
      returns with RCX == RIP.  The syscall tracing code does not
      respect this property.
      
      For example, this program:
      
      int main()
      {
      	extern const char syscall_rip[];
      	unsigned long rcx = 1;
      	unsigned long orig_rcx = rcx;
      	asm ("mov $-1, %%eax\n\t"
      	     "syscall\n\t"
      	     "syscall_rip:"
      	     : "+c" (rcx) : : "r11");
      	printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
      	       rcx, (unsigned long)syscall_rip, orig_rcx);
      	return 0;
      }
      
      prints:
      
        syscall: RCX = 400556  RIP = 400556  orig RCX = 1
      
      Running it under strace gives this instead:
      
        syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1
      
      This changes FIXUP_TOP_OF_STACK to match sysret, causing the
      test to show RCX == RIP even under strace.
      
      It looks like this is a partial revert of:
      88e4bc32686e ("[PATCH] x86-64 architecture specific sync for 2.5.8")
      from the historic git tree.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/c9a418c3dc3993cb88bb7773800225fd318a4c67.1421453410.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0fcedc86
  10. 14 1月, 2015 2 次提交
  11. 03 1月, 2015 1 次提交
    • A
      x86, entry: Switch stacks on a paranoid entry from userspace · 48e08d0f
      Andy Lutomirski 提交于
      This causes all non-NMI, non-double-fault kernel entries from
      userspace to run on the normal kernel stack.  Double-fault is
      exempt to minimize confusion if we double-fault directly from
      userspace due to a bad kernel stack.
      
      This is, suprisingly, simpler and shorter than the current code.  It
      removes the IMO rather frightening paranoid_userspace path, and it
      make sync_regs much simpler.
      
      There is no risk of stack overflow due to this change -- the kernel
      stack that we switch to is empty.
      
      This will also enable us to create non-atomic sections within
      machine checks from userspace, which will simplify memory failure
      handling.  It will also allow the upcoming fsgsbase code to be
      simplified, because it doesn't need to worry about usergs when
      scheduling in paranoid_exit, as that code no longer exists.
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      48e08d0f
  12. 16 12月, 2014 1 次提交
    • J
      x86: Avoid building unused IRQ entry stubs · 2414e021
      Jan Beulich 提交于
      When X86_LOCAL_APIC (i.e. unconditionally on x86-64),
      first_system_vector will never end up being higher than
      LOCAL_TIMER_VECTOR (0xef), and hence building stubs for vectors
      0xef...0xff is pointlessly reducing code density. Deal with this at
      build time already.
      
      Taking into consideration that X86_64 implies X86_LOCAL_APIC, also
      simplify (and hence make easier to read and more consistent with the
      change done here) some #if-s in arch/x86/kernel/irqinit.c.
      
      While we could further improve the packing of the IRQ entry stubs (the
      four ones now left in the last set could be fit into the four padding
      bytes each of the final four sets have) this doesn't seem to provide
      any real benefit: Both irq_entries_start and common_interrupt getting
      cache line aligned, eliminating the 30th set would just produce 32
      bytes of padding between the 29th and common_interrupt.
      
      [ tglx: Folded lguest fix from Dan Carpenter ]
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: lguest@lists.ozlabs.org
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Link: http://lkml.kernel.org/r/54574D5F0200007800044389@mail.emea.novell.com
      Link: http://lkml.kernel.org/r/20141115185718.GB6530@mwandaSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2414e021
  13. 14 12月, 2014 1 次提交
    • D
      x86: hook up execveat system call · 27d6ec7a
      David Drysdale 提交于
      Hook up x86-64, i386 and x32 ABIs.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d6ec7a
  14. 24 11月, 2014 1 次提交
    • A
      x86_64, traps: Rework bad_iret · b645af2d
      Andy Lutomirski 提交于
      It's possible for iretq to userspace to fail.  This can happen because
      of a bad CS, SS, or RIP.
      
      Historically, we've handled it by fixing up an exception from iretq to
      land at bad_iret, which pretends that the failed iret frame was really
      the hardware part of #GP(0) from userspace.  To make this work, there's
      an extra fixup to fudge the gs base into a usable state.
      
      This is suboptimal because it loses the original exception.  It's also
      buggy because there's no guarantee that we were on the kernel stack to
      begin with.  For example, if the failing iret happened on return from an
      NMI, then we'll end up executing general_protection on the NMI stack.
      This is bad for several reasons, the most immediate of which is that
      general_protection, as a non-paranoid idtentry, will try to deliver
      signals and/or schedule from the wrong stack.
      
      This patch throws out bad_iret entirely.  As a replacement, it augments
      the existing swapgs fudge into a full-blown iret fixup, mostly written
      in C.  It's should be clearer and more correct.
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b645af2d