1. 04 5月, 2017 1 次提交
  2. 01 5月, 2017 1 次提交
    • S
      ring-buffer: Return reader page back into existing ring buffer · 73a757e6
      Steven Rostedt (VMware) 提交于
      When reading the ring buffer for consuming, it is optimized for splice,
      where a page is taken out of the ring buffer (zero copy) and sent to the
      reading consumer. When the read is finished with the page, it calls
      ring_buffer_free_read_page(), which simply frees the page. The next time the
      reader needs to get a page from the ring buffer, it must call
      ring_buffer_alloc_read_page() which allocates and initializes a reader page
      for the ring buffer to be swapped into the ring buffer for a new filled page
      for the reader.
      
      The problem is that there's no reason to actually free the page when it is
      passed back to the ring buffer. It can hold it off and reuse it for the next
      iteration. This completely removes the interaction with the page_alloc
      mechanism.
      
      Using the trace-cmd utility to record all events (causing trace-cmd to
      require reading lots of pages from the ring buffer, and calling
      ring_buffer_alloc/free_read_page() several times), and also assigning a
      stack trace trigger to the mm_page_alloc event, we can see how many times
      the ring_buffer_alloc_read_page() needed to allocate a page for the ring
      buffer.
      
      Before this change:
      
        # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
        # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
        9968
      
      After this change:
      
        # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
        # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
        4
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      73a757e6
  3. 21 4月, 2017 10 次提交
    • S
      tracing/ftrace: Allow for the traceonoff probe be unique to instances · 2290f2c5
      Steven Rostedt (VMware) 提交于
      Have the traceon/off function probe triggers affect only the instance they
      are set in. This required making the trace_on/off accessible for other files
      in the tracing directory.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2290f2c5
    • S
      tracing/ftrace: Enable snapshot function trigger to work with instances · cab50379
      Steven Rostedt (VMware) 提交于
      Modify the snapshot probe trigger to work with instances. This way the
      snapshot function trigger will only affect the instance that it is added to
      in the set_ftrace_filter file.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      cab50379
    • S
      tracing/ftrace: Add a better way to pass data via the probe functions · 6e444319
      Steven Rostedt (VMware) 提交于
      With the redesign of the registration and execution of the function probes
      (triggers), data can now be passed from the setup of the probe to the probe
      callers that are specific to the trace_array it is on. Although, all probes
      still only affect the toplevel trace array, this change will allow for
      instances to have their own probes separated from other instances and the
      top array.
      
      That is, something like the stacktrace probe can be set to trace only in an
      instance and not the toplevel trace array. This isn't implement yet, but
      this change sets the ground work for the change.
      
      When a probe callback is triggered (someone writes the probe format into
      set_ftrace_filter), it calls register_ftrace_function_probe() passing in
      init_data that will be used to initialize the probe. Then for every matching
      function, register_ftrace_function_probe() will call the probe_ops->init()
      function with the init data that was passed to it, as well as an address to
      a place holder that is associated with the probe and the instance. The first
      occurrence will have a NULL in the pointer. The init() function will then
      initialize it. If other probes are added, or more functions are part of the
      probe, the place holder will be passed to the init() function with the place
      holder data that it was initialized to the last time.
      
      Then this place_holder is passed to each of the other probe_ops functions,
      where it can be used in the function callback. When the probe_ops free()
      function is called, it can be called either with the rip of the function
      that is being removed from the probe, or zero, indicating that there are no
      more functions attached to the probe, and the place holder is about to be
      freed. This gives the probe_ops a way to free the data it assigned to the
      place holder if it was allocade during the first init call.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      6e444319
    • S
      ftrace: Dynamically create the probe ftrace_ops for the trace_array · 7b60f3d8
      Steven Rostedt (VMware) 提交于
      In order to eventually have each trace_array instance have its own unique
      set of function probes (triggers), the trace array needs to hold the ops and
      the filters for the probes.
      
      This is the first step to accomplish this. Instead of having the private
      data of the probe ops point to the trace_array, create a separate list that
      the trace_array holds. There's only one private_data for a probe, we need
      one per trace_array. The probe ftrace_ops will be dynamically created for
      each instance, instead of being static.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      7b60f3d8
    • S
      tracing: Pass the trace_array into ftrace_probe_ops functions · b5f081b5
      Steven Rostedt (VMware) 提交于
      Pass the trace_array associated to a ftrace_probe_ops into the probe_ops
      func(), init() and free() functions. The trace_array is the descriptor that
      describes a tracing instance. This will help create the infrastructure that
      will allow having function probes unique to tracing instances.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b5f081b5
    • S
      tracing: Have the trace_array hold the list of registered func probes · 04ec7bb6
      Steven Rostedt (VMware) 提交于
      Add a link list to the trace_array to hold func probes that are registered.
      Currently, all function probes are the same for all instances as it was
      before, that is, only the top level trace_array holds the function probes.
      But this lays the ground work to have function probes be attached to
      individual instances, and having the event trigger only affect events in the
      given instance. But that work is still to be done.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      04ec7bb6
    • S
      ftrace: Have unregister_ftrace_function_probe_func() return a value · d3d532d7
      Steven Rostedt (VMware) 提交于
      Currently unregister_ftrace_function_probe_func() is a void function. It
      does not give any feedback if an error occurred or no item was found to
      remove and nothing was done.
      
      Change it to return status and success if it removed something. Also update
      the callers to return that feedback to the user.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      d3d532d7
    • S
      ftrace: Remove data field from ftrace_func_probe structure · 1a48df00
      Steven Rostedt (VMware) 提交于
      No users of the function probes uses the data field anymore. Remove it, and
      change the init function to take a void *data parameter instead of a
      void **data, because the init will just get the data that the registering
      function was received, and there's no state after it is called.
      
      The other functions for ftrace_probe_ops still take the data parameter, but
      it will currently only be passed NULL. It will stay as a parameter for
      future data to be passed to these functions.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1a48df00
    • S
      tracing: Have the snapshot trigger use the mapping helper functions · 1a93f8bd
      Steven Rostedt (VMware) 提交于
      As the data pointer for individual ips will soon be removed and no longer
      passed to the callback function probe handlers, convert the snapshot
      trigger counter over to the new ftrace_func_mapper helper functions.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1a93f8bd
    • S
      ftrace: Pass probe ops to probe function · bca6c8d0
      Steven Rostedt (VMware) 提交于
      In preparation to cleaning up the probe function registration code, the
      "data" parameter will eventually be removed from the probe->func() call.
      Instead it will receive its own "ops" function, in which it can set up its
      own data that it needs to map.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bca6c8d0
  4. 20 4月, 2017 1 次提交
    • S
      tracing: Allocate the snapshot buffer before enabling probe · df62db5b
      Steven Rostedt (VMware) 提交于
      Currently the snapshot trigger enables the probe and then allocates the
      snapshot. If the probe triggers before the allocation, it could cause the
      snapshot to fail and turn tracing off. It's best to allocate the snapshot
      buffer first, and then enable the trigger. If something goes wrong in the
      enabling of the trigger, the snapshot buffer is still allocated, but it can
      also be freed by the user by writting zero into the snapshot buffer file.
      
      Also add a check of the return status of alloc_snapshot().
      
      Cc: stable@vger.kernel.org
      Fixes: 77fd5c15 ("tracing: Add snapshot trigger to function probes")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      df62db5b
  5. 18 4月, 2017 2 次提交
  6. 25 3月, 2017 4 次提交
    • S
      tracing: Move trace_handle_return() out of line · af0009fc
      Steven Rostedt (VMware) 提交于
      Currently trace_handle_return() looks like this:
      
       static inline enum print_line_t trace_handle_return(struct trace_seq *s)
       {
              return trace_seq_has_overflowed(s) ?
                      TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
       }
      
      Where trace_seq_overflowed(s) is:
      
       static inline bool trace_seq_has_overflowed(struct trace_seq *s)
       {
      	return s->full || seq_buf_has_overflowed(&s->seq);
       }
      
      And seq_buf_has_overflowed(&s->seq) is:
      
       static inline bool
       seq_buf_has_overflowed(struct seq_buf *s)
       {
      	return s->len > s->size;
       }
      
      Making trace_handle_return() into:
      
       return (s->full || (s->seq->len > s->seq->size)) ?
                 TRACE_TYPE_PARTIAL_LINE :
                 TRACE_TYPE_HANDLED;
      
      One would think this is not an issue to keep as an inline. But because this
      is used in the TRACE_EVENT() macro, it is extended for every tracepoint in
      the system. Taking a look at a single tracepoint x86_irq_vector (was the
      first one I randomly chosen). As trace_handle_return is used in the
      TRACE_EVENT() macro of trace_raw_output_##call() we disassemble
      trace_raw_output_x86_irq_vector and do a diff:
      
      - is the original
      + is the out-of-line code
      
      I removed identical lines that were different just due to different
      addresses.
      
      --- /tmp/irq-vec-orig	2017-03-16 09:12:48.569384851 -0400
      +++ /tmp/irq-vec-ool	2017-03-16 09:13:39.378153385 -0400
      @@ -6,27 +6,23 @@
              53                      push   %rbx
              48 89 fb                mov    %rdi,%rbx
              4c 8b a7 c0 20 00 00    mov    0x20c0(%rdi),%r12
              e8 f7 72 13 00          callq  ffffffff81155c80 <trace_raw_output_prep>
              83 f8 01                cmp    $0x1,%eax
              74 05                   je     ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23>
              5b                      pop    %rbx
              41 5c                   pop    %r12
              5d                      pop    %rbp
              c3                      retq
              41 8b 54 24 08          mov    0x8(%r12),%edx
      -       48 8d bb 98 10 00 00    lea    0x1098(%rbx),%rdi
      +       48 81 c3 98 10 00 00    add    $0x1098,%rbx
      -       48 c7 c6 7b 8a a0 81    mov    $0xffffffff81a08a7b,%rsi
      +       48 c7 c6 ab 8a a0 81    mov    $0xffffffff81a08aab,%rsi
      -       e8 c5 85 13 00          callq  ffffffff81156f70 <trace_seq_printf>
      
       === here's the start of the main difference ===
      
      +       48 89 df                mov    %rbx,%rdi
      +       e8 62 7e 13 00          callq  ffffffff81156810 <trace_seq_printf>
      -       8b 93 b8 20 00 00       mov    0x20b8(%rbx),%edx
      -       31 c0                   xor    %eax,%eax
      -       85 d2                   test   %edx,%edx
      -       75 11                   jne    ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58>
      -       48 8b 83 a8 20 00 00    mov    0x20a8(%rbx),%rax
      -       48 39 83 a0 20 00 00    cmp    %rax,0x20a0(%rbx)
      -       0f 93 c0                setae  %al
      +       48 89 df                mov    %rbx,%rdi
      +       e8 4a c5 12 00          callq  ffffffff8114af00 <trace_handle_return>
              5b                      pop    %rbx
      -       0f b6 c0                movzbl %al,%eax
      
       === end ===
      
              41 5c                   pop    %r12
              5d                      pop    %rbp
              c3                      retq
      
      If you notice, the original has 22 bytes of text more than the out of line
      version. As this is for every TRACE_EVENT() defined in the system, this can
      become quite large.
      
         text	   data	    bss	    dec	    hex	filename
      8690305	5450490	1298432	15439227	 eb957b	vmlinux-orig
      8681725	5450490	1298432	15430647	 eb73f7	vmlinux-handle
      
      This change has a total of 8580 bytes in savings.
      
       $ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
      324
      
      That's 324 tracepoints. But this does not include modules (which contain
      many more tracepoints). For an allyesconfig build:
      
       $ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
      1401
      
      That's 1401 tracepoints giving us:
      
         text    data     bss     dec     hex filename
      137920629       140221067       53264384        331406080       13c0db00 vmlinux-allyes-orig
      137827709       140221067       53264384        331313160       13bf7008 vmlinux-allyes-handle
      
      92920 bytes in savings!!!
      
      Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      af0009fc
    • S
      ftrace: Have function tracing start in early boot up · dbeafd0d
      Steven Rostedt (VMware) 提交于
      Register the function tracer right after the tracing buffers are initialized
      in early boot up. This will allow function tracing to begin early if it is
      enabled via the kernel command line.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dbeafd0d
    • S
      tracing: Postpone tracer start-up tests till the system is more robust · 9afecfbb
      Steven Rostedt (VMware) 提交于
      As tracing can now be enabled very early in boot up, even before some
      critical system services (like scheduling), do not run the tracer selftests
      until after early_initcall() is performed. If a tracer is registered before
      such time, it is saved off in a list and the test is run when the system is
      able to handle more diverse functions.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9afecfbb
    • S
      tracing: Split tracing initialization into two for early initialization · e725c731
      Steven Rostedt (VMware) 提交于
      Create an early_trace_init() function that will initialize the buffers and
      allow for ealier use of trace_printk(). This will also allow for future work
      to have function tracing start earlier at boot up.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      e725c731
  7. 04 3月, 2017 1 次提交
  8. 01 3月, 2017 1 次提交
  9. 17 2月, 2017 1 次提交
  10. 03 2月, 2017 1 次提交
  11. 01 2月, 2017 1 次提交
    • E
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman 提交于
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
  12. 27 12月, 2016 1 次提交
  13. 25 12月, 2016 1 次提交
  14. 13 12月, 2016 1 次提交
  15. 09 12月, 2016 1 次提交
    • S
      tracing: Replace kmap with copy_from_user() in trace_marker writing · 656c7f0d
      Steven Rostedt (Red Hat) 提交于
      Instead of using get_user_pages_fast() and kmap_atomic() when writing
      to the trace_marker file, just allocate enough space on the ring buffer
      directly, and write into it via copy_from_user().
      
      Writing into the trace_marker file use to allocate a temporary buffer
      to perform the copy_from_user(), as we didn't want to write into the
      ring buffer if the copy failed. But as a trace_marker write is suppose
      to be extremely fast, and allocating memory causes other tracepoints to
      trigger, Peter Zijlstra suggested using get_user_pages_fast() and
      kmap_atomic() to keep the user space pages in memory and reading it
      directly. But Henrik Austad had issues with this because it required taking
      the mm->mmap_sem and causing long delays with the write.
      
      Instead, just allocate the space in the ring buffer and use
      copy_from_user() directly. If it faults, return -EFAULT and write
      "<faulted>" into the ring buffer.
      
      Link: http://lkml.kernel.org/r/20161208124018.72dd0f86@gandalf.local.home
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Henrik Austad <henrik@austad.us>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Updates: d696b58c "tracing: Do not allocate buffer for trace_marker"
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      656c7f0d
  16. 02 12月, 2016 1 次提交
  17. 30 11月, 2016 1 次提交
  18. 24 11月, 2016 3 次提交
  19. 23 11月, 2016 1 次提交
  20. 16 11月, 2016 1 次提交
  21. 15 11月, 2016 1 次提交
  22. 26 9月, 2016 1 次提交
  23. 25 9月, 2016 1 次提交
  24. 12 9月, 2016 1 次提交
  25. 03 9月, 2016 1 次提交
    • S
      tracing: Added hardware latency tracer · e7c15cd8
      Steven Rostedt (Red Hat) 提交于
      The hardware latency tracer has been in the PREEMPT_RT patch for some time.
      It is used to detect possible SMIs or any other hardware interruptions that
      the kernel is unaware of. Note, NMIs may also be detected, but that may be
      good to note as well.
      
      The logic is pretty simple. It simply creates a thread that spins on a
      single CPU for a specified amount of time (width) within a periodic window
      (window). These numbers may be adjusted by their cooresponding names in
      
         /sys/kernel/tracing/hwlat_detector/
      
      The defaults are window = 1000000 us (1 second)
                       width  =  500000 us (1/2 second)
      
      The loop consists of:
      
      	t1 = trace_clock_local();
      	t2 = trace_clock_local();
      
      Where trace_clock_local() is a variant of sched_clock().
      
      The difference of t2 - t1 is recorded as the "inner" timestamp and also the
      timestamp  t1 - prev_t2 is recorded as the "outer" timestamp. If either of
      these differences are greater than the time denoted in
      /sys/kernel/tracing/tracing_thresh then it records the event.
      
      When this tracer is started, and tracing_thresh is zero, it changes to the
      default threshold of 10 us.
      
      The hwlat tracer in the PREEMPT_RT patch was originally written by
      Jon Masters. I have modified it quite a bit and turned it into a
      tracer.
      Based-on-code-by: NJon Masters <jcm@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e7c15cd8