1. 26 9月, 2012 1 次提交
    • F
      x86: Exit RCU extended QS on notify resume · edf55fda
      Frederic Weisbecker 提交于
      do_notify_resume() may be called on irq or exception
      exit. But at that time the exception has already called
      rcu_user_enter() and the irq has already called rcu_irq_exit().
      
      Since it can use RCU read side critical section, we must call
      rcu_user_exit() before doing anything there. Then we must call
      back rcu_user_enter() after this function because we know we are
      going to userspace from there.
      
      This complete support for userspace RCU extended quiescent state
      in x86-64.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      edf55fda
  2. 25 9月, 2012 1 次提交
    • F
      cputime: Make finegrained irqtime accounting generally available · fdf9c356
      Frederic Weisbecker 提交于
      There is no known reason for this option to be unavailable on other
      archs than x86. They just need to call enable_sched_clock_irqtime()
      if they have a sufficiently finegrained clock to make it working.
      
      Move it to the general option and let the user choose between
      it and pure tick based or virtual cputime accounting.
      
      Note that virtual cputime accounting already performs a finegrained
      irqtime accounting. CONFIG_IRQ_TIME_ACCOUNTING is a kind of middle ground
      between tick and virtual based accounting. So CONFIG_IRQ_TIME_ACCOUNTING
      and CONFIG_VIRT_CPU_ACCOUNTING are mutually exclusive choices.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      fdf9c356
  3. 05 9月, 2012 1 次提交
  4. 23 8月, 2012 1 次提交
    • S
      ftrace/x86: Add support for -mfentry to x86_64 · d57c5d51
      Steven Rostedt 提交于
      If the kernel is compiled with gcc 4.6.0 which supports -mfentry,
      then use that instead of mcount.
      
      With mcount, frame pointers are forced with the -pg option and we
      get something like:
      
      <can_vma_merge_before>:
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             53                      push   %rbx
             41 51                   push   %r9
             e8 fe 6a 39 00          callq  ffffffff81483d00 <mcount>
             31 c0                   xor    %eax,%eax
             48 89 fb                mov    %rdi,%rbx
             48 89 d7                mov    %rdx,%rdi
             48 33 73 30             xor    0x30(%rbx),%rsi
             48 f7 c6 ff ff ff f7    test   $0xfffffffff7ffffff,%rsi
      
      With -mfentry, frame pointers are no longer forced and the call looks
      like this:
      
      <can_vma_merge_before>:
             e8 33 af 37 00          callq  ffffffff81461b40 <__fentry__>
             53                      push   %rbx
             48 89 fb                mov    %rdi,%rbx
             31 c0                   xor    %eax,%eax
             48 89 d7                mov    %rdx,%rdi
             41 51                   push   %r9
             48 33 73 30             xor    0x30(%rbx),%rsi
             48 f7 c6 ff ff ff f7    test   $0xfffffffff7ffffff,%rsi
      
      This adds the ftrace hook at the beginning of the function before a
      frame is set up, and allows the function callbacks to be able to access
      parameters. As kprobes now can use function tracing (at least on x86)
      this speeds up the kprobe hooks that are at the beginning of the
      function.
      
      Link: http://lkml.kernel.org/r/20120807194100.130477900@goodmis.orgAcked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d57c5d51
  5. 10 8月, 2012 2 次提交
    • J
      perf: Add ability to attach user stack dump to sample · c5ebcedb
      Jiri Olsa 提交于
      Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
      of the user level stack on sample. The size of the dump is specified by
      sample_stack_user value.
      
      Being able to dump parts of the user stack, starting from the stack
      pointer, will be useful to make a post mortem dwarf CFI based stack
      unwinding.
      
      Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
      architecture provides user stack dump on perf event samples.  This needs
      access to the user stack pointer which is not unified across
      architectures. Enabling this for x86 architecture.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5ebcedb
    • J
      perf: Unified API to record selective sets of arch registers · c5e63197
      Jiri Olsa 提交于
      This brings a new API to help the selective dump of registers on event
      sampling, and its implementation for x86 arch.
      
      Added HAVE_PERF_REGS config option to determine if the architecture
      provides perf registers ABI.
      
      The information about desired registers will be passed in u64 mask.
      It's up to the architecture to map the registers into the mask bits.
      
      For the x86 arch implementation, both 32 and 64 bit registers bits are
      defined within single enum to ensure 64 bit system can provide register
      dump for compat task if needed in the future.
      Original-patch-by: NFrederic Weisbecker <fweisbec@gmail.com>
      [ Added missing linux/errno.h include ]
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: "Frank Ch. Eigler" <fche@redhat.com>
      Cc: Arun Sharma <asharma@fb.com>
      Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Ulrich Drepper <drepper@gmail.com>
      Link: http://lkml.kernel.org/r/1344345647-11536-2-git-send-email-jolsa@redhat.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      c5e63197
  6. 31 7月, 2012 2 次提交
  7. 09 7月, 2012 1 次提交
  8. 02 6月, 2012 1 次提交
  9. 27 5月, 2012 2 次提交
    • L
      x86: use the new generic strnlen_user() function · 5723aa99
      Linus Torvalds 提交于
      This throws away the old x86-specific functions in favor of the generic
      optimized version.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5723aa99
    • L
      x86: use generic strncpy_from_user routine · 4ae73f2d
      Linus Torvalds 提交于
      The generic strncpy_from_user() is not really optimal, since it is
      designed to work on both little-endian and big-endian.  And on
      little-endian you can simplify much of the logic to find the first zero
      byte, since little-endian arithmetic doesn't have to worry about the
      carry bit propagating into earlier bytes (only later bytes, which we
      don't care about).
      
      But I have patches to make the generic routines use the architecture-
      specific <asm/word-at-a-time.h> infrastructure, so that we can regain
      the little-endian optimizations.  But before we do that, switch over to
      the generic routines to make the patches each do just one well-defined
      thing.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ae73f2d
  10. 22 5月, 2012 2 次提交
  11. 21 5月, 2012 2 次提交
  12. 18 5月, 2012 1 次提交
    • P
      MCA: delete all remaining traces of microchannel bus support. · bb8187d3
      Paul Gortmaker 提交于
      Hardware with MCA bus is limited to 386 and 486 class machines
      that are now 20+ years old and typically with less than 32MB
      of memory.  A quick search on the internet, and you see that
      even the MCA hobbyist/enthusiast community has lost interest
      in the early 2000 era and never really even moved ahead from
      the 2.4 kernels to the 2.6 series.
      
      This deletes anything remaining related to CONFIG_MCA from core
      kernel code and from the x86 architecture.  There is no point in
      carrying this any further into the future.
      
      One complication to watch for is inadvertently scooping up
      stuff relating to machine check, since there is overlap in
      the TLA name space (e.g. arch/x86/boot/mca.c).
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: x86@kernel.org
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      bb8187d3
  13. 14 5月, 2012 1 次提交
  14. 08 5月, 2012 1 次提交
  15. 05 5月, 2012 2 次提交
  16. 04 5月, 2012 1 次提交
    • L
      vfs: make word-at-a-time accesses handle a non-existing page · e419b4cc
      Linus Torvalds 提交于
      It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
      can have holes in the kernel address space: it seems to happen easily
      with Xen, and it looks like the AMD gart64 code will also punch holes
      dynamically.
      
      Actually hitting that case is still very unlikely, so just do the
      access, and take an exception and fix it up for the very unlikely case
      of it being a page-crosser with no next page.
      
      And hey, this abstraction might even help other architectures that have
      other issues with unaligned word accesses than the possible missing next
      page.  IOW, this could do the byte order magic too.
      
      Peter Anvin fixed a thinko in the shifting for the exception case.
      Reported-and-tested-by: NJana Saout <jana@saout.de>
      Cc:  Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e419b4cc
  17. 28 4月, 2012 1 次提交
    • S
      ftrace/x86: Remove the complex ftrace NMI handling code · 4a6d70c9
      Steven Rostedt 提交于
      As ftrace function tracing would require modifying code that could
      be executed in NMI context, which is not stopped with stop_machine(),
      ftrace had to do a complex algorithm with various stages of setup
      and memory barriers to make it work.
      
      With the new breakpoint method, this is no longer required. The changes
      to the code can be done without any problem in NMI context, as well as
      without stop machine altogether. Remove the complex code as it is
      no longer needed.
      
      Also, a lot of the notrace annotations could be removed from the
      NMI code as it is now safe to trace them. With the exception of
      do_nmi itself, which does some special work to handle running in
      the debug stack. The breakpoint method can cause NMIs to double
      nest the debug stack if it's not setup properly, and that is done
      in do_nmi(), thus that function must not be traced.
      
      (Note the arch sh may want to do the same)
      
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      4a6d70c9
  18. 26 4月, 2012 1 次提交
  19. 25 4月, 2012 1 次提交
  20. 21 4月, 2012 1 次提交
  21. 20 4月, 2012 1 次提交
  22. 14 4月, 2012 2 次提交
  23. 13 4月, 2012 3 次提交
  24. 21 3月, 2012 1 次提交
  25. 16 3月, 2012 1 次提交
    • C
      [PATCH v3] ipc: provide generic compat versions of IPC syscalls · 48b25c43
      Chris Metcalf 提交于
      When using the "compat" APIs, architectures will generally want to
      be able to make direct syscalls to msgsnd(), shmctl(), etc., and
      in the kernel we would want them to be handled directly by
      compat_sys_xxx() functions, as is true for other compat syscalls.
      
      However, for historical reasons, several of the existing compat IPC
      syscalls do not do this.  semctl() expects a pointer to the fourth
      argument, instead of the fourth argument itself.  msgsnd(), msgrcv()
      and shmat() expect arguments in different order.
      
      This change adds an ARCH_WANT_OLD_COMPAT_IPC config option that can be
      set to preserve this behavior for ports that use it (x86, sparc, powerpc,
      s390, and mips).  No actual semantics are changed for those architectures,
      and there is only a minimal amount of code refactoring in ipc/compat.c.
      
      Newer architectures like tile (and perhaps future architectures such
      as arm64 and unicore64) should not select this option, and thus can
      avoid having any IPC-specific code at all in their architecture-specific
      compat layer.  In the same vein, if this option is not selected, IPC_64
      mode is assumed, since that's what the <asm-generic> headers expect.
      
      The workaround code in "tile" for msgsnd() and msgrcv() is removed
      with this change; it also fixes the bug that shmat() and semctl() were
      not being properly handled.
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NChris Metcalf <cmetcalf@tilera.com>
      48b25c43
  26. 09 3月, 2012 1 次提交
  27. 06 3月, 2012 1 次提交
  28. 28 2月, 2012 1 次提交
  29. 24 2月, 2012 1 次提交
  30. 21 2月, 2012 1 次提交
  31. 17 2月, 2012 1 次提交
    • S
      uprobes, mm, x86: Add the ability to install and remove uprobes breakpoints · 2b144498
      Srikar Dronamraju 提交于
      Add uprobes support to the core kernel, with x86 support.
      
      This commit adds the kernel facilities, the actual uprobes
      user-space ABI and perf probe support comes in later commits.
      
      General design:
      
      Uprobes are maintained in an rb-tree indexed by inode and offset
      (the offset here is from the start of the mapping). For a unique
      (inode, offset) tuple, there can be at most one uprobe in the
      rb-tree.
      
      Since the (inode, offset) tuple identifies a unique uprobe, more
      than one user may be interested in the same uprobe. This provides
      the ability to connect multiple 'consumers' to the same uprobe.
      
      Each consumer defines a handler and a filter (optional). The
      'handler' is run every time the uprobe is hit, if it matches the
      'filter' criteria.
      
      The first consumer of a uprobe causes the breakpoint to be
      inserted at the specified address and subsequent consumers are
      appended to this list.  On subsequent probes, the consumer gets
      appended to the existing list of consumers. The breakpoint is
      removed when the last consumer unregisters. For all other
      unregisterations, the consumer is removed from the list of
      consumers.
      
      Given a inode, we get a list of the mms that have mapped the
      inode. Do the actual registration if mm maps the page where a
      probe needs to be inserted/removed.
      
      We use a temporary list to walk through the vmas that map the
      inode.
      
      - The number of maps that map the inode, is not known before we
        walk the rmap and keeps changing.
      - extending vm_area_struct wasn't recommended, it's a
        size-critical data structure.
      - There can be more than one maps of the inode in the same mm.
      
      We add callbacks to the mmap methods to keep an eye on text vmas
      that are of interest to uprobes.  When a vma of interest is mapped,
      we insert the breakpoint at the right address.
      
      Uprobe works by replacing the instruction at the address defined
      by (inode, offset) with the arch specific breakpoint
      instruction. We save a copy of the original instruction at the
      uprobed address.
      
      This is needed for:
      
       a. executing the instruction out-of-line (xol).
       b. instruction analysis for any subsequent fixups.
       c. restoring the instruction back when the uprobe is unregistered.
      
      We insert or delete a breakpoint instruction, and this
      breakpoint instruction is assumed to be the smallest instruction
      available on the platform. For fixed size instruction platforms
      this is trivially true, for variable size instruction platforms
      the breakpoint instruction is typically the smallest (often a
      single byte).
      
      Writing the instruction is done by COWing the page and changing
      the instruction during the copy, this even though most platforms
      allow atomic writes of the breakpoint instruction. This also
      mirrors the behaviour of a ptrace() memory write to a PRIVATE
      file map.
      
      The core worker is derived from KSM's replace_page() logic.
      
      In essence, similar to KSM:
      
       a. allocate a new page and copy over contents of the page that
          has the uprobed vaddr
       b. modify the copy and insert the breakpoint at the required
          address
       c. switch the original page with the copy containing the
          breakpoint
       d. flush page tables.
      
      replace_page() is being replicated here because of some minor
      changes in the type of pages and also because Hugh Dickins had
      plans to improve replace_page() for KSM specific work.
      
      Instruction analysis on x86 is based on instruction decoder and
      determines if an instruction can be probed and determines the
      necessary fixups after singlestep.  Instruction analysis is done
      at probe insertion time so that we avoid having to repeat the
      same analysis every time a probe is hit.
      
      A lot of code here is due to the improvement/suggestions/inputs
      from Peter Zijlstra.
      
      Changelog:
      
      (v10):
       - Add code to clear REX.B prefix as suggested by Denys Vlasenko
         and Masami Hiramatsu.
      
      (v9):
       - Use insn_offset_modrm as suggested by Masami Hiramatsu.
      
      (v7):
      
       Handle comments from Peter Zijlstra:
      
       - Dont take reference to inode. (expect inode to uprobe_register to be sane).
       - Use PTR_ERR to set the return value.
       - No need to take reference to inode.
       - use PTR_ERR to return error value.
       - register and uprobe_unregister share code.
      
      (v5):
      
       - Modified del_consumer as per comments from Peter.
       - Drop reference to inode before dropping reference to uprobe.
       - Use i_size_read(inode) instead of inode->i_size.
       - Ensure uprobe->consumers is NULL, before __uprobe_unregister() is called.
       - Includes errno.h as recommended by Stephen Rothwell to fix a build issue
         on sparc defconfig
       - Remove restrictions while unregistering.
       - Earlier code leaked inode references under some conditions while
         registering/unregistering.
       - Continue the vma-rmap walk even if the intermediate vma doesnt
         meet the requirements.
       - Validate the vma found by find_vma before inserting/removing the
         breakpoint
       - Call del_consumer under mutex_lock.
       - Use hash locks.
       - Handle mremap.
       - Introduce find_least_offset_node() instead of close match logic in
         find_uprobe
       - Uprobes no more depends on MM_OWNER; No reference to task_structs
         while inserting/removing a probe.
       - Uses read_mapping_page instead of grab_cache_page so that the pages
         have valid content.
       - pass NULL to get_user_pages for the task parameter.
       - call SetPageUptodate on the new page allocated in write_opcode.
       - fix leaking a reference to the new page under certain conditions.
       - Include Instruction Decoder if Uprobes gets defined.
       - Remove const attributes for instruction prefix arrays.
       - Uses mm_context to know if the application is 32 bit.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Also-written-by: NJim Keniston <jkenisto@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Denys Vlasenko <vda.linux@googlemail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linux-mm <linux-mm@kvack.org>
      Link: http://lkml.kernel.org/r/20120209092642.GE16600@linux.vnet.ibm.com
      [ Made various small edits to the commit log ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2b144498