1. 22 7月, 2014 1 次提交
    • S
      x86_32, entry: Store badsys error code in %eax · 8142b215
      Sven Wegener 提交于
      Commit 554086d8 ("x86_32, entry: Do syscall exit work on badsys
      (CVE-2014-4508)") introduced a regression in the x86_32 syscall entry
      code, resulting in syscall() not returning proper errors for undefined
      syscalls on CPUs supporting the sysenter feature.
      
      The following code:
      
      > int result = syscall(666);
      > printf("result=%d errno=%d error=%s\n", result, errno, strerror(errno));
      
      results in:
      
      > result=666 errno=0 error=Success
      
      Obviously, the syscall return value is the called syscall number, but it
      should have been an ENOSYS error. When run under ptrace it behaves
      correctly, which makes it hard to debug in the wild:
      
      > result=-1 errno=38 error=Function not implemented
      
      The %eax register is the return value register. For debugging via ptrace
      the syscall entry code stores the complete register context on the
      stack. The badsys handlers only store the ENOSYS error code in the
      ptrace register set and do not set %eax like a regular syscall handler
      would. The old resume_userspace call chain contains code that clobbers
      %eax and it restores %eax from the ptrace registers afterwards. The same
      goes for the ptrace-enabled call chain. When ptrace is not used, the
      syscall return value is the passed-in syscall number from the untouched
      %eax register.
      
      Use %eax as the return value register in syscall_badsys and
      sysenter_badsys, like a real syscall handler does, and have the caller
      push the value onto the stack for ptrace access.
      Signed-off-by: NSven Wegener <sven.wegener@stealer.net>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.11.1407221022380.31021@titan.int.lan.stealer.netReviewed-and-tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: <stable@vger.kernel.org> # If 554086d8 is backported
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      8142b215
  2. 24 6月, 2014 1 次提交
  3. 05 5月, 2014 1 次提交
  4. 01 5月, 2014 1 次提交
  5. 24 4月, 2014 1 次提交
  6. 10 1月, 2014 1 次提交
  7. 09 11月, 2013 1 次提交
  8. 25 9月, 2013 1 次提交
  9. 05 9月, 2013 1 次提交
  10. 21 6月, 2013 1 次提交
    • S
      x86, trace: Add irq vector tracepoints · cf910e83
      Seiji Aguchi 提交于
      [Purpose of this patch]
      
      As Vaibhav explained in the thread below, tracepoints for irq vectors
      are useful.
      
      http://www.spinics.net/lists/mm-commits/msg85707.html
      
      <snip>
      The current interrupt traces from irq_handler_entry and irq_handler_exit
      provide when an interrupt is handled.  They provide good data about when
      the system has switched to kernel space and how it affects the currently
      running processes.
      
      There are some IRQ vectors which trigger the system into kernel space,
      which are not handled in generic IRQ handlers.  Tracing such events gives
      us the information about IRQ interaction with other system events.
      
      The trace also tells where the system is spending its time.  We want to
      know which cores are handling interrupts and how they are affecting other
      processes in the system.  Also, the trace provides information about when
      the cores are idle and which interrupts are changing that state.
      <snip>
      
      On the other hand, my usecase is tracing just local timer event and
      getting a value of instruction pointer.
      
      I suggested to add an argument local timer event to get instruction pointer before.
      But there is another way to get it with external module like systemtap.
      So, I don't need to add any argument to irq vector tracepoints now.
      
      [Patch Description]
      
      Vaibhav's patch shared a trace point ,irq_vector_entry/irq_vector_exit, in all events.
      But there is an above use case to trace specific irq_vector rather than tracing all events.
      In this case, we are concerned about overhead due to unwanted events.
      
      So, add following tracepoints instead of introducing irq_vector_entry/exit.
      so that we can enable them independently.
         - local_timer_vector
         - reschedule_vector
         - call_function_vector
         - call_function_single_vector
         - irq_work_entry_vector
         - error_apic_vector
         - thermal_apic_vector
         - threshold_apic_vector
         - spurious_apic_vector
         - x86_platform_ipi_vector
      
      Also, introduce a logic switching IDT at enabling/disabling time so that a time penalty
      makes a zero when tracepoints are disabled. Detailed explanations are as follows.
       - Create trace irq handlers with entering_irq()/exiting_irq().
       - Create a new IDT, trace_idt_table, at boot time by adding a logic to
         _set_gate(). It is just a copy of original idt table.
       - Register the new handlers for tracpoints to the new IDT by introducing
         macros to alloc_intr_gate() called at registering time of irq_vector handlers.
       - Add checking, whether irq vector tracing is on/off, into load_current_idt().
         This has to be done below debug checking for these reasons.
         - Switching to debug IDT may be kicked while tracing is enabled.
         - On the other hands, switching to trace IDT is kicked only when debugging
           is disabled.
      
      In addition, the new IDT is created only when CONFIG_TRACING is enabled to avoid being
      used for other purposes.
      Signed-off-by: NSeiji Aguchi <seiji.aguchi@hds.com>
      Link: http://lkml.kernel.org/r/51C323ED.5050708@hds.comSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      cf910e83
  11. 13 2月, 2013 1 次提交
  12. 04 2月, 2013 3 次提交
  13. 17 1月, 2013 1 次提交
    • A
      xen: Fix stack corruption in xen_failsafe_callback for 32bit PVOPS guests. · 9174adbe
      Andrew Cooper 提交于
      This fixes CVE-2013-0190 / XSA-40
      
      There has been an error on the xen_failsafe_callback path for failed
      iret, which causes the stack pointer to be wrong when entering the
      iret_exc error path.  This can result in the kernel crashing.
      
      In the classic kernel case, the relevant code looked a little like:
      
              popl %eax      # Error code from hypervisor
              jz 5f
              addl $16,%esp
              jmp iret_exc   # Hypervisor said iret fault
      5:      addl $16,%esp
                             # Hypervisor said segment selector fault
      
      Here, there are two identical addls on either option of a branch which
      appears to have been optimised by hoisting it above the jz, and
      converting it to an lea, which leaves the flags register unaffected.
      
      In the PVOPS case, the code looks like:
      
              popl_cfi %eax         # Error from the hypervisor
              lea 16(%esp),%esp     # Add $16 before choosing fault path
              CFI_ADJUST_CFA_OFFSET -16
              jz 5f
              addl $16,%esp         # Incorrectly adjust %esp again
              jmp iret_exc
      
      It is possible unprivileged userspace applications to cause this
      behaviour, for example by loading an LDT code selector, then changing
      the code selector to be not-present.  At this point, there is a race
      condition where it is possible for the hypervisor to return back to
      userspace from an interrupt, fault on its own iret, and inject a
      failsafe_callback into the kernel.
      
      This bug has been present since the introduction of Xen PVOPS support
      in commit 5ead97c8 (xen: Core Xen implementation), in 2.6.23.
      Signed-off-by: NFrediano Ziglio <frediano.ziglio@citrix.com>
      Signed-off-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      9174adbe
  14. 20 12月, 2012 1 次提交
  15. 29 11月, 2012 1 次提交
  16. 20 10月, 2012 1 次提交
    • D
      xen/x86: don't corrupt %eip when returning from a signal handler · a349e23d
      David Vrabel 提交于
      In 32 bit guests, if a userspace process has %eax == -ERESTARTSYS
      (-512) or -ERESTARTNOINTR (-513) when it is interrupted by an event
      /and/ the process has a pending signal then %eip (and %eax) are
      corrupted when returning to the main process after handling the
      signal.  The application may then crash with SIGSEGV or a SIGILL or it
      may have subtly incorrect behaviour (depending on what instruction it
      returned to).
      
      The occurs because handle_signal() is incorrectly thinking that there
      is a system call that needs to restarted so it adjusts %eip and %eax
      to re-execute the system call instruction (even though user space had
      not done a system call).
      
      If %eax == -514 (-ERESTARTNOHAND (-514) or -ERESTART_RESTARTBLOCK
      (-516) then handle_signal() only corrupted %eax (by setting it to
      -EINTR).  This may cause the application to crash or have incorrect
      behaviour.
      
      handle_signal() assumes that regs->orig_ax >= 0 means a system call so
      any kernel entry point that is not for a system call must push a
      negative value for orig_ax.  For example, for physical interrupts on
      bare metal the inverse of the vector is pushed and page_fault() sets
      regs->orig_ax to -1, overwriting the hardware provided error code.
      
      xen_hypervisor_callback() was incorrectly pushing 0 for orig_ax
      instead of -1.
      
      Classic Xen kernels pushed %eax which works as %eax cannot be both
      non-negative and -RESTARTSYS (etc.), but using -1 is consistent with
      other non-system call entry points and avoids some of the tests in
      handle_signal().
      
      There were similar bugs in xen_failsafe_callback() of both 32 and
      64-bit guests. If the fault was corrected and the normal return path
      was used then 0 was incorrectly pushed as the value for orig_ax.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NJan Beulich <JBeulich@suse.com>
      Acked-by: NIan Campbell <ian.campbell@citrix.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      a349e23d
  17. 13 10月, 2012 1 次提交
  18. 01 10月, 2012 3 次提交
  19. 22 9月, 2012 1 次提交
  20. 14 9月, 2012 1 次提交
  21. 31 7月, 2012 2 次提交
  22. 20 7月, 2012 3 次提交
  23. 02 6月, 2012 1 次提交
    • A
      x86: get rid of calling do_notify_resume() when returning to kernel mode · 44fbbb3d
      Al Viro 提交于
      If we end up calling do_notify_resume() with !user_mode(refs), it
      does nothing (do_signal() explicitly bails out and we can't get there
      with TIF_NOTIFY_RESUME in such situations).  Then we jump to
      resume_userspace_sig, which rechecks the same thing and bails out
      to resume_kernel, thus breaking the loop.
      
      It's easier and cheaper to check *before* calling do_notify_resume()
      and bail out to resume_kernel immediately.  And kill the check in
      do_signal()...
      
      Note that on amd64 we can't get there with !user_mode() at all - asm
      glue takes care of that.
      Acked-and-reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      44fbbb3d
  24. 21 4月, 2012 1 次提交
  25. 23 3月, 2012 1 次提交
    • D
      x86-32: Fix endless loop when processing signals for kernel tasks · 29a2e283
      Dmitry Adamushko 提交于
      The problem occurs on !CONFIG_VM86 kernels [1] when a kernel-mode task
      returns from a system call with a pending signal.
      
      A real-life scenario is a child of 'khelper' returning from a failed
      kernel_execve() in ____call_usermodehelper() [ kernel/kmod.c ].
      kernel_execve() fails due to a pending SIGKILL, which is the result of
      "kill -9 -1" (at least, busybox's init does it upon reboot).
      
      The loop is as follows:
      
      * syscall_exit_work:
       - work_pending:            // start_of_the_loop
       - work_notify_sig:
         - do_notify_resume()
           - do_signal()
             - if (!user_mode(regs)) return;
       - resume_userspace         // TIF_SIGPENDING is still set
       - work_pending             // so we call work_pending => goto
                                  // start_of_the_loop
      
      More information can be found in another LKML thread:
      http://www.serverphorums.com/read.php?12,457826
      
      [1] the problem was also seen on MIPS.
      Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Link: http://lkml.kernel.org/r/1332448765.2299.68.camel@dimm
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      29a2e283
  26. 18 1月, 2012 2 次提交
    • E
      audit: inline audit_syscall_entry to reduce burden on archs · b05d8447
      Eric Paris 提交于
      Every arch calls:
      
      if (unlikely(current->audit_context))
      	audit_syscall_entry()
      
      which requires knowledge about audit (the existance of audit_context) in
      the arch code.  Just do it all in static inline in audit.h so that arch's
      can remain blissfully ignorant.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      b05d8447
    • E
      Audit: push audit success and retcode into arch ptrace.h · d7e7528b
      Eric Paris 提交于
      The audit system previously expected arches calling to audit_syscall_exit to
      supply as arguments if the syscall was a success and what the return code was.
      Audit also provides a helper AUDITSC_RESULT which was supposed to simplify things
      by converting from negative retcodes to an audit internal magic value stating
      success or failure.  This helper was wrong and could indicate that a valid
      pointer returned to userspace was a failed syscall.  The fix is to fix the
      layering foolishness.  We now pass audit_syscall_exit a struct pt_reg and it
      in turns calls back into arch code to collect the return value and to
      determine if the syscall was a success or failure.  We also define a generic
      is_syscall_success() macro which determines success/failure based on if the
      value is < -MAX_ERRNO.  This works for arches like x86 which do not use a
      separate mechanism to indicate syscall failure.
      
      We make both the is_syscall_success() and regs_return_value() static inlines
      instead of macros.  The reason is because the audit function must take a void*
      for the regs.  (uml calls theirs struct uml_pt_regs instead of just struct
      pt_regs so audit_syscall_exit can't take a struct pt_regs).  Since the audit
      function takes a void* we need to use static inlines to cast it back to the
      arch correct structure to dereference it.
      
      The other major change is that on some arches, like ia64, MIPS and ppc, we
      change regs_return_value() to give us the negative value on syscall failure.
      THE only other user of this macro, kretprobe_example.c, won't notice and it
      makes the value signed consistently for the audit functions across all archs.
      
      In arch/sh/kernel/ptrace_64.c I see that we were using regs[9] in the old
      audit code as the return value.  But the ptrace_64.h code defined the macro
      regs_return_value() as regs[3].  I have no idea which one is correct, but this
      patch now uses the regs_return_value() function, so it now uses regs[3].
      
      For powerpc we previously used regs->result but now use the
      regs_return_value() function which uses regs->gprs[3].  regs->gprs[3] is
      always positive so the regs_return_value(), much like ia64 makes it negative
      before calling the audit code when appropriate.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: H. Peter Anvin <hpa@zytor.com> [for x86 portion]
      Acked-by: Tony Luck <tony.luck@intel.com> [for ia64]
      Acked-by: Richard Weinberger <richard@nod.at> [for uml]
      Acked-by: David S. Miller <davem@davemloft.net> [for sparc]
      Acked-by: Ralf Baechle <ralf@linux-mips.org> [for mips]
      Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [for ppc]
      d7e7528b
  27. 06 12月, 2011 1 次提交
  28. 18 11月, 2011 1 次提交
    • H
      x86: Generate system call tables and unistd_*.h from tables · 303395ac
      H. Peter Anvin 提交于
      Generate system call tables and unistd_*.h automatically from the
      tables in arch/x86/syscalls.  All other information, like NR_syscalls,
      is auto-generated, some of which is in asm-offsets_*.c.
      
      This allows us to keep all the system call information in one place,
      and allows for kernel space and user space to see different
      information; this is currently used for the ia32 system call numbers
      when building the 64-bit kernel, but will be used by the x32 ABI in
      the near future.
      
      This also removes some gratuitious differences between i386, x86-64
      and ia32; in particular, now all system call tables are generated with
      the same mechanism.
      
      Cc: H. J. Lu <hjl.tools@gmail.com>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      303395ac
  29. 26 8月, 2011 1 次提交
  30. 09 3月, 2011 2 次提交
    • S
      x86: Fix binutils-2.21 symbol related build failures · 2ae9d293
      Sedat Dilek 提交于
      New binutils version 2.21.0.20110302-1 started checking that the symbol
      parameter to the .size directive matches the entry name's
      symbol parameter, unearthing two mismatches:
      
        AS      arch/x86/kernel/acpi/wakeup_rm.o
        arch/x86/kernel/acpi/wakeup_rm.S: Assembler messages:
        arch/x86/kernel/acpi/wakeup_rm.S:12: Error: .size expression with symbol `wakeup_code_start' does not evaluate to a constant
      
        arch/x86/kernel/entry_32.S: Assembler messages:
        arch/x86/kernel/entry_32.S:1421: Error: .size expression with
        symbol `apf_page_fault' does not evaluate to a constant
      
      The problem was discovered while using Debian's binutils
      (2.21.0.20110302-1) and experimenting with binutils from
      upstream.
      
      Thanks Alexander and H.J. for the vital help.
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      Cc: Alexander van Heukelum <heukelum@fastmail.fm>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      LKML-Reference: <1299620364-21644-1-git-send-email-sedat.dilek@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2ae9d293
    • J
      x86: Separate out entry text section · ea714547
      Jiri Olsa 提交于
      Put x86 entry code into a separate link section: .entry.text.
      
      Separating the entry text section seems to have performance
      benefits - caused by more efficient instruction cache usage.
      
      Running hackbench with perf stat --repeat showed that the change
      compresses the icache footprint. The icache load miss rate went
      down by about 15%:
      
       before patch:
               19417627  L1-icache-load-misses      ( +-   0.147% )
      
       after patch:
               16490788  L1-icache-load-misses      ( +-   0.180% )
      
      The motivation of the patch was to fix a particular kprobes
      bug that relates to the entry text section, the performance
      advantage was discovered accidentally.
      
      Whole perf output follows:
      
       - results for current tip tree:
      
        Performance counter stats for './hackbench/hackbench 10' (500 runs):
      
               19417627  L1-icache-load-misses      ( +-   0.147% )
             2676914223  instructions             #      0.497 IPC     ( +- 0.079% )
             5389516026  cycles                     ( +-   0.144% )
      
            0.206267711  seconds time elapsed   ( +-   0.138% )
      
       - results for current tip tree with the patch applied:
      
        Performance counter stats for './hackbench/hackbench 10' (500 runs):
      
               16490788  L1-icache-load-misses      ( +-   0.180% )
             2717734941  instructions             #      0.502 IPC     ( +- 0.079% )
             5414756975  cycles                     ( +-   0.148% )
      
            0.206747566  seconds time elapsed   ( +-   0.137% )
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: masami.hiramatsu.pt@hitachi.com
      Cc: ananth@in.ibm.com
      Cc: davem@davemloft.net
      Cc: 2nddept-manager@sdl.hitachi.co.jp
      LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ea714547
  31. 01 3月, 2011 1 次提交