1. 21 4月, 2017 1 次提交
  2. 18 4月, 2017 1 次提交
  3. 25 3月, 2017 4 次提交
    • S
      tracing: Move trace_handle_return() out of line · af0009fc
      Steven Rostedt (VMware) 提交于
      Currently trace_handle_return() looks like this:
      
       static inline enum print_line_t trace_handle_return(struct trace_seq *s)
       {
              return trace_seq_has_overflowed(s) ?
                      TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED;
       }
      
      Where trace_seq_overflowed(s) is:
      
       static inline bool trace_seq_has_overflowed(struct trace_seq *s)
       {
      	return s->full || seq_buf_has_overflowed(&s->seq);
       }
      
      And seq_buf_has_overflowed(&s->seq) is:
      
       static inline bool
       seq_buf_has_overflowed(struct seq_buf *s)
       {
      	return s->len > s->size;
       }
      
      Making trace_handle_return() into:
      
       return (s->full || (s->seq->len > s->seq->size)) ?
                 TRACE_TYPE_PARTIAL_LINE :
                 TRACE_TYPE_HANDLED;
      
      One would think this is not an issue to keep as an inline. But because this
      is used in the TRACE_EVENT() macro, it is extended for every tracepoint in
      the system. Taking a look at a single tracepoint x86_irq_vector (was the
      first one I randomly chosen). As trace_handle_return is used in the
      TRACE_EVENT() macro of trace_raw_output_##call() we disassemble
      trace_raw_output_x86_irq_vector and do a diff:
      
      - is the original
      + is the out-of-line code
      
      I removed identical lines that were different just due to different
      addresses.
      
      --- /tmp/irq-vec-orig	2017-03-16 09:12:48.569384851 -0400
      +++ /tmp/irq-vec-ool	2017-03-16 09:13:39.378153385 -0400
      @@ -6,27 +6,23 @@
              53                      push   %rbx
              48 89 fb                mov    %rdi,%rbx
              4c 8b a7 c0 20 00 00    mov    0x20c0(%rdi),%r12
              e8 f7 72 13 00          callq  ffffffff81155c80 <trace_raw_output_prep>
              83 f8 01                cmp    $0x1,%eax
              74 05                   je     ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23>
              5b                      pop    %rbx
              41 5c                   pop    %r12
              5d                      pop    %rbp
              c3                      retq
              41 8b 54 24 08          mov    0x8(%r12),%edx
      -       48 8d bb 98 10 00 00    lea    0x1098(%rbx),%rdi
      +       48 81 c3 98 10 00 00    add    $0x1098,%rbx
      -       48 c7 c6 7b 8a a0 81    mov    $0xffffffff81a08a7b,%rsi
      +       48 c7 c6 ab 8a a0 81    mov    $0xffffffff81a08aab,%rsi
      -       e8 c5 85 13 00          callq  ffffffff81156f70 <trace_seq_printf>
      
       === here's the start of the main difference ===
      
      +       48 89 df                mov    %rbx,%rdi
      +       e8 62 7e 13 00          callq  ffffffff81156810 <trace_seq_printf>
      -       8b 93 b8 20 00 00       mov    0x20b8(%rbx),%edx
      -       31 c0                   xor    %eax,%eax
      -       85 d2                   test   %edx,%edx
      -       75 11                   jne    ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58>
      -       48 8b 83 a8 20 00 00    mov    0x20a8(%rbx),%rax
      -       48 39 83 a0 20 00 00    cmp    %rax,0x20a0(%rbx)
      -       0f 93 c0                setae  %al
      +       48 89 df                mov    %rbx,%rdi
      +       e8 4a c5 12 00          callq  ffffffff8114af00 <trace_handle_return>
              5b                      pop    %rbx
      -       0f b6 c0                movzbl %al,%eax
      
       === end ===
      
              41 5c                   pop    %r12
              5d                      pop    %rbp
              c3                      retq
      
      If you notice, the original has 22 bytes of text more than the out of line
      version. As this is for every TRACE_EVENT() defined in the system, this can
      become quite large.
      
         text	   data	    bss	    dec	    hex	filename
      8690305	5450490	1298432	15439227	 eb957b	vmlinux-orig
      8681725	5450490	1298432	15430647	 eb73f7	vmlinux-handle
      
      This change has a total of 8580 bytes in savings.
      
       $ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
      324
      
      That's 324 tracepoints. But this does not include modules (which contain
      many more tracepoints). For an allyesconfig build:
      
       $ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l
      1401
      
      That's 1401 tracepoints giving us:
      
         text    data     bss     dec     hex filename
      137920629       140221067       53264384        331406080       13c0db00 vmlinux-allyes-orig
      137827709       140221067       53264384        331313160       13bf7008 vmlinux-allyes-handle
      
      92920 bytes in savings!!!
      
      Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      af0009fc
    • S
      ftrace: Have function tracing start in early boot up · dbeafd0d
      Steven Rostedt (VMware) 提交于
      Register the function tracer right after the tracing buffers are initialized
      in early boot up. This will allow function tracing to begin early if it is
      enabled via the kernel command line.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dbeafd0d
    • S
      tracing: Postpone tracer start-up tests till the system is more robust · 9afecfbb
      Steven Rostedt (VMware) 提交于
      As tracing can now be enabled very early in boot up, even before some
      critical system services (like scheduling), do not run the tracer selftests
      until after early_initcall() is performed. If a tracer is registered before
      such time, it is saved off in a list and the test is run when the system is
      able to handle more diverse functions.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9afecfbb
    • S
      tracing: Split tracing initialization into two for early initialization · e725c731
      Steven Rostedt (VMware) 提交于
      Create an early_trace_init() function that will initialize the buffers and
      allow for ealier use of trace_printk(). This will also allow for future work
      to have function tracing start earlier at boot up.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      e725c731
  4. 01 3月, 2017 1 次提交
  5. 17 2月, 2017 1 次提交
  6. 03 2月, 2017 1 次提交
  7. 01 2月, 2017 1 次提交
    • E
      fs: Better permission checking for submounts · 93faccbb
      Eric W. Biederman 提交于
      To support unprivileged users mounting filesystems two permission
      checks have to be performed: a test to see if the user allowed to
      create a mount in the mount namespace, and a test to see if
      the user is allowed to access the specified filesystem.
      
      The automount case is special in that mounting the original filesystem
      grants permission to mount the sub-filesystems, to any user who
      happens to stumble across the their mountpoint and satisfies the
      ordinary filesystem permission checks.
      
      Attempting to handle the automount case by using override_creds
      almost works.  It preserves the idea that permission to mount
      the original filesystem is permission to mount the sub-filesystem.
      Unfortunately using override_creds messes up the filesystems
      ordinary permission checks.
      
      Solve this by being explicit that a mount is a submount by introducing
      vfs_submount, and using it where appropriate.
      
      vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
      sget and friends know that a mount is a submount so they can take appropriate
      action.
      
      sget and sget_userns are modified to not perform any permission checks
      on submounts.
      
      follow_automount is modified to stop using override_creds as that
      has proven problemantic.
      
      do_mount is modified to always remove the new MS_SUBMOUNT flag so
      that we know userspace will never by able to specify it.
      
      autofs4 is modified to stop using current_real_cred that was put in
      there to handle the previous version of submount permission checking.
      
      cifs is modified to pass the mountpoint all of the way down to vfs_submount.
      
      debugfs is modified to pass the mountpoint all of the way down to
      trace_automount by adding a new parameter.  To make this change easier
      a new typedef debugfs_automount_t is introduced to capture the type of
      the debugfs automount function.
      
      Cc: stable@vger.kernel.org
      Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
      Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
      Reviewed-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Reviewed-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      93faccbb
  8. 25 12月, 2016 1 次提交
  9. 13 12月, 2016 1 次提交
  10. 09 12月, 2016 1 次提交
    • S
      tracing: Replace kmap with copy_from_user() in trace_marker writing · 656c7f0d
      Steven Rostedt (Red Hat) 提交于
      Instead of using get_user_pages_fast() and kmap_atomic() when writing
      to the trace_marker file, just allocate enough space on the ring buffer
      directly, and write into it via copy_from_user().
      
      Writing into the trace_marker file use to allocate a temporary buffer
      to perform the copy_from_user(), as we didn't want to write into the
      ring buffer if the copy failed. But as a trace_marker write is suppose
      to be extremely fast, and allocating memory causes other tracepoints to
      trigger, Peter Zijlstra suggested using get_user_pages_fast() and
      kmap_atomic() to keep the user space pages in memory and reading it
      directly. But Henrik Austad had issues with this because it required taking
      the mm->mmap_sem and causing long delays with the write.
      
      Instead, just allocate the space in the ring buffer and use
      copy_from_user() directly. If it faults, return -EFAULT and write
      "<faulted>" into the ring buffer.
      
      Link: http://lkml.kernel.org/r/20161208124018.72dd0f86@gandalf.local.home
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Henrik Austad <henrik@austad.us>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Updates: d696b58c "tracing: Do not allocate buffer for trace_marker"
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      656c7f0d
  11. 02 12月, 2016 1 次提交
  12. 30 11月, 2016 1 次提交
  13. 24 11月, 2016 3 次提交
  14. 23 11月, 2016 1 次提交
  15. 16 11月, 2016 1 次提交
  16. 15 11月, 2016 1 次提交
  17. 26 9月, 2016 1 次提交
  18. 25 9月, 2016 1 次提交
  19. 12 9月, 2016 1 次提交
  20. 03 9月, 2016 1 次提交
    • S
      tracing: Added hardware latency tracer · e7c15cd8
      Steven Rostedt (Red Hat) 提交于
      The hardware latency tracer has been in the PREEMPT_RT patch for some time.
      It is used to detect possible SMIs or any other hardware interruptions that
      the kernel is unaware of. Note, NMIs may also be detected, but that may be
      good to note as well.
      
      The logic is pretty simple. It simply creates a thread that spins on a
      single CPU for a specified amount of time (width) within a periodic window
      (window). These numbers may be adjusted by their cooresponding names in
      
         /sys/kernel/tracing/hwlat_detector/
      
      The defaults are window = 1000000 us (1 second)
                       width  =  500000 us (1/2 second)
      
      The loop consists of:
      
      	t1 = trace_clock_local();
      	t2 = trace_clock_local();
      
      Where trace_clock_local() is a variant of sched_clock().
      
      The difference of t2 - t1 is recorded as the "inner" timestamp and also the
      timestamp  t1 - prev_t2 is recorded as the "outer" timestamp. If either of
      these differences are greater than the time denoted in
      /sys/kernel/tracing/tracing_thresh then it records the event.
      
      When this tracer is started, and tracing_thresh is zero, it changes to the
      default threshold of 10 us.
      
      The hwlat tracer in the PREEMPT_RT patch was originally written by
      Jon Masters. I have modified it quite a bit and turned it into a
      tracer.
      Based-on-code-by: NJon Masters <jcm@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e7c15cd8
  21. 24 8月, 2016 1 次提交
  22. 05 7月, 2016 2 次提交
  23. 24 6月, 2016 1 次提交
    • S
      tracing: Skip more functions when doing stack tracing of events · be54f69c
      Steven Rostedt (Red Hat) 提交于
       # echo 1 > options/stacktrace
       # echo 1 > events/sched/sched_switch/enable
       # cat trace
                <idle>-0     [002] d..2  1982.525169: <stack trace>
       => save_stack_trace
       => __ftrace_trace_stack
       => trace_buffer_unlock_commit_regs
       => event_trigger_unlock_commit
       => trace_event_buffer_commit
       => trace_event_raw_event_sched_switch
       => __schedule
       => schedule
       => schedule_preempt_disabled
       => cpu_startup_entry
       => start_secondary
      
      The above shows that we are seeing 6 functions before ever making it to the
      caller of the sched_switch event.
      
       # echo stacktrace > events/sched/sched_switch/trigger
       # cat trace
                <idle>-0     [002] d..3  2146.335208: <stack trace>
       => trace_event_buffer_commit
       => trace_event_raw_event_sched_switch
       => __schedule
       => schedule
       => schedule_preempt_disabled
       => cpu_startup_entry
       => start_secondary
      
      The stacktrace trigger isn't as bad, because it adds its own skip to the
      stacktracing, but still has two events extra.
      
      One issue is that if the stacktrace passes its own "regs" then there should
      be no addition to the skip, as the regs will not include the functions being
      called. This was an issue that was fixed by commit 7717c6be ("tracing:
      Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()" as adding
      the skip number for kprobes made the probes not have any stack at all.
      
      But since this is only an issue when regs is being used, a skip should be
      added if regs is NULL. Now we have:
      
       # echo 1 > options/stacktrace
       # echo 1 > events/sched/sched_switch/enable
       # cat trace
                <idle>-0     [000] d..2  1297.676333: <stack trace>
       => __schedule
       => schedule
       => schedule_preempt_disabled
       => cpu_startup_entry
       => rest_init
       => start_kernel
       => x86_64_start_reservations
       => x86_64_start_kernel
      
       # echo stacktrace > events/sched/sched_switch/trigger
       # cat trace
                <idle>-0     [002] d..3  1370.759745: <stack trace>
       => __schedule
       => schedule
       => schedule_preempt_disabled
       => cpu_startup_entry
       => start_secondary
      
      And kprobes are not touched.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      be54f69c
  24. 20 6月, 2016 5 次提交
  25. 04 5月, 2016 1 次提交
    • S
      tracing: Use temp buffer when filtering events · 0fc1b09f
      Steven Rostedt (Red Hat) 提交于
      Filtering of events requires the data to be written to the ring buffer
      before it can be decided to filter or not. This is because the parameters of
      the filter are based on the result that is written to the ring buffer and
      not on the parameters that are passed into the trace functions.
      
      The ftrace ring buffer is optimized for writing into the ring buffer and
      committing. The discard procedure used when filtering decides the event
      should be discarded is much more heavy weight. Thus, using a temporary
      filter when filtering events can speed things up drastically.
      
      Without a temp buffer we have:
      
       # trace-cmd start -p nop
       # perf stat -r 10 hackbench 50
             0.790706626 seconds time elapsed ( +-  0.71% )
      
       # trace-cmd start -e all
       # perf stat -r 10 hackbench 50
             1.566904059 seconds time elapsed ( +-  0.27% )
      
       # trace-cmd start -e all -f 'common_preempt_count==20'
       # perf stat -r 10 hackbench 50
             1.690598511 seconds time elapsed ( +-  0.19% )
      
       # trace-cmd start -e all -f 'common_preempt_count!=20'
       # perf stat -r 10 hackbench 50
             1.707486364 seconds time elapsed ( +-  0.30% )
      
      The first run above is without any tracing, just to get a based figure.
      hackbench takes ~0.79 seconds to run on the system.
      
      The second run enables tracing all events where nothing is filtered. This
      increases the time by 100% and hackbench takes 1.57 seconds to run.
      
      The third run filters all events where the preempt count will equal "20"
      (this should never happen) thus all events are discarded. This takes 1.69
      seconds to run. This is 10% slower than just committing the events!
      
      The last run enables all events and filters where the filter will commit all
      events, and this takes 1.70 seconds to run. The filtering overhead is
      approximately 10%. Thus, the discard and commit of an event from the ring
      buffer may be about the same time.
      
      With this patch, the numbers change:
      
       # trace-cmd start -p nop
       # perf stat -r 10 hackbench 50
             0.778233033 seconds time elapsed ( +-  0.38% )
      
       # trace-cmd start -e all
       # perf stat -r 10 hackbench 50
             1.582102692 seconds time elapsed ( +-  0.28% )
      
       # trace-cmd start -e all -f 'common_preempt_count==20'
       # perf stat -r 10 hackbench 50
             1.309230710 seconds time elapsed ( +-  0.22% )
      
       # trace-cmd start -e all -f 'common_preempt_count!=20'
       # perf stat -r 10 hackbench 50
             1.786001924 seconds time elapsed ( +-  0.20% )
      
      The first run is again the base with no tracing.
      
      The second run is all tracing with no filtering. It is a little slower, but
      that may be well within the noise.
      
      The third run shows that discarding all events only took 1.3 seconds. This
      is a speed up of 23%! The discard is much faster than even the commit.
      
      The one downside is shown in the last run. Events that are not discarded by
      the filter will take longer to add, this is due to the extra copy of the
      event.
      
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0fc1b09f
  26. 30 4月, 2016 5 次提交