1. 26 1月, 2009 2 次提交
    • A
      blktrace: add ftrace plugin · c71a8961
      Arnaldo Carvalho de Melo 提交于
      Impact: New way of using the blktrace infrastructure
      
      This drops the requirement of userspace utilities to use the blktrace
      facility.
      
      Configuration is done thru sysfs, adding a "trace" directory to the
      partition directory where blktrace can be enabled for the associated
      request_queue.
      
      The same filters present in the IOCTL interface are present as sysfs
      device attributes.
      
      The /sys/block/sdX/sdXN/trace/enable file allows tracing without any
      filters.
      
      The other files in this directory: pid, act_mask, start_lba and end_lba
      can be used with the same meaning as with the IOCTL interface.
      
      Using the sysfs interface will only setup the request_queue->blk_trace
      fields, tracing will only take place when the "blk" tracer is selected
      via the ftrace interface, as in the following example:
      
      To see the trace, one can use the /d/tracing/trace file or the
      /d/tracign/trace_pipe file, with semantics defined in the ftrace
      documentation in Documentation/ftrace.txt.
      
      [root@f10-1 ~]# cat /t/trace
             kjournald-305   [000]  3046.491224:   8,1    A WBS 6367 + 8 <- (8,1) 6304
             kjournald-305   [000]  3046.491227:   8,1    Q   R 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491236:   8,1    G  RB 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491239:   8,1    P  NS [kjournald]
             kjournald-305   [000]  3046.491242:   8,1    I RBS 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491251:   8,1    D  WB 6367 + 8 [kjournald]
             kjournald-305   [000]  3046.491610:   8,1    U  WS [kjournald] 1
                <idle>-0     [000]  3046.511914:   8,1    C  RS 6367 + 8 [6367]
      [root@f10-1 ~]#
      
      The default line context (prefix) format is the one described in the ftrace
      documentation, with the blktrace specific bits using its existing format,
      described in blkparse(8).
      
      If one wants to have the classic blktrace formatting, this is possible by
      using:
      
      [root@f10-1 ~]# echo blk_classic > /t/trace_options
      [root@f10-1 ~]# cat /t/trace
        8,1    0  3046.491224   305  A WBS 6367 + 8 <- (8,1) 6304
        8,1    0  3046.491227   305  Q   R 6367 + 8 [kjournald]
        8,1    0  3046.491236   305  G  RB 6367 + 8 [kjournald]
        8,1    0  3046.491239   305  P  NS [kjournald]
        8,1    0  3046.491242   305  I RBS 6367 + 8 [kjournald]
        8,1    0  3046.491251   305  D  WB 6367 + 8 [kjournald]
        8,1    0  3046.491610   305  U  WS [kjournald] 1
        8,1    0  3046.511914     0  C  RS 6367 + 8 [6367]
      [root@f10-1 ~]#
      
      Using the ftrace standard format allows more flexibility, such
      as the ability of asking for backtraces via trace_options:
      
      [root@f10-1 ~]# echo noblk_classic > /t/trace_options
      [root@f10-1 ~]# echo stacktrace > /t/trace_options
      
      [root@f10-1 ~]# cat /t/trace
             kjournald-305   [000]  3318.826779:   8,1    A WBS 6375 + 8 <- (8,1) 6312
             kjournald-305   [000]  3318.826782:
       <= submit_bio
       <= submit_bh
       <= sync_dirty_buffer
       <= journal_commit_transaction
       <= kjournald
       <= kthread
       <= child_rip
             kjournald-305   [000]  3318.826836:   8,1    Q   R 6375 + 8 [kjournald]
             kjournald-305   [000]  3318.826837:
       <= generic_make_request
       <= submit_bio
       <= submit_bh
       <= sync_dirty_buffer
       <= journal_commit_transaction
       <= kjournald
       <= kthread
      
      Please read the ftrace documentation to use aditional, standardized
      tracing filters such as /d/tracing/trace_cpumask, etc.
      
      See also /d/tracing/trace_mark to add comments in the trace stream,
      that is equivalent to the /d/block/sdaN/msg interface.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c71a8961
    • A
      ftrace: add ftrace_vprintk · 9011262a
      Arnaldo Carvalho de Melo 提交于
      Impact: new helper function
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9011262a
  2. 23 1月, 2009 4 次提交
    • F
      tracing/function-graph-tracer: various fixes and features · 9005f3eb
      Frederic Weisbecker 提交于
      This patch brings various bugfixes:
      
      - Drop the first irrelevant task switch on the very beginning of a trace.
      
      - Drop the OVERHEAD word from the headers, the DURATION word is sufficient
        and will not overlap other columns.
      
      - Make the headers fit well their respective columns whatever the
        selected options.
      
      Ie, default options:
      
       # tracer: function_graph
       #
       # CPU  DURATION                  FUNCTION CALLS
       # |     |   |                     |   |   |   |
      
        1)   0.646 us    |                    }
        1)               |                    mem_cgroup_del_lru_list() {
        1)   0.624 us    |                      lookup_page_cgroup();
        1)   1.970 us    |                    }
      
       echo funcgraph-proc > trace_options
      
       # tracer: function_graph
       #
       # CPU  TASK/PID        DURATION                  FUNCTION CALLS
       # |    |    |           |   |                     |   |   |   |
      
        0)   bash-2937    |   0.895 us    |                }
        0)   bash-2937    |   0.888 us    |                __rcu_read_unlock();
        0)   bash-2937    |   0.864 us    |                conv_uni_to_pc();
        0)   bash-2937    |   1.015 us    |                __rcu_read_lock();
      
       echo nofuncgraph-cpu > trace_options
       echo nofuncgraph-proc > trace_options
      
       # tracer: function_graph
       #
       #   DURATION                  FUNCTION CALLS
       #    |   |                     |   |   |   |
      
         3.752 us    |                  native_pud_val();
         0.616 us    |                  native_pud_val();
         0.624 us    |                  native_pmd_val();
      
      About features, one can now disable the duration (this will hide the
      overhead too for convenient reasons and because on  doesn't need
      overhead if it hasn't the duration):
      
       echo nofuncgraph-duration > trace_options
      
       # tracer: function_graph
       #
       #                FUNCTION CALLS
       #                |   |   |   |
      
                 cap_vm_enough_memory() {
                   __vm_enough_memory() {
                     vm_acct_memory();
                   }
                 }
               }
      
      And at last, an option to print the absolute time:
      
       //Restart from default options
       echo funcgraph-abstime > trace_options
      
       # tracer: function_graph
       #
       #      TIME       CPU  DURATION                  FUNCTION CALLS
       #       |         |     |   |                     |   |   |   |
      
         261.339774 |   1) + 42.823 us   |    }
         261.339775 |   1)   1.045 us    |    _spin_lock_irq();
         261.339777 |   1)   0.940 us    |    _spin_lock_irqsave();
         261.339778 |   1)   0.752 us    |    _spin_unlock_irqrestore();
         261.339780 |   1)   0.857 us    |    _spin_unlock_irq();
         261.339782 |   1)               |    flush_to_ldisc() {
         261.339783 |   1)               |      tty_ldisc_ref() {
         261.339783 |   1)               |        tty_ldisc_try() {
         261.339784 |   1)   1.075 us    |          _spin_lock_irqsave();
         261.339786 |   1)   0.842 us    |          _spin_unlock_irqrestore();
         261.339788 |   1)   4.211 us    |        }
         261.339788 |   1)   5.662 us    |      }
      
      The format is seconds.usecs.
      
      I guess no one needs the nanosec precision here, the main goal is to have
      an overview about the general timings of events, and to see the place when
      the trace switches from one cpu to another.
      
      ie:
      
         274.874760 |   1)   0.676 us    |      _spin_unlock();
         274.874762 |   1)   0.609 us    |      native_load_sp0();
         274.874763 |   1)   0.602 us    |      native_load_tls();
         274.878739 |   0)   0.722 us    |                  }
         274.878740 |   0)   0.714 us    |                  native_pmd_val();
         274.878741 |   0)   0.730 us    |                  native_pmd_val();
      
      Here there is a 4000 usecs difference when we switch the cpu.
      
      Changes in V2:
      
      - Completely fix the first pointless task switch.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9005f3eb
    • S
      trace, lockdep: manual preempt count adding for local_bh_disable · 7e49fcce
      Steven Rostedt 提交于
      Impact: fix to preempt trace triggering lockdep check_flag failure
      
      In local_bh_disable, the use of add_preempt_count causes the
      preempt tracer to start recording the time preemption is off.
      But because it already modified the preempt_count to show
      softirqs disabled, and before it called the lockdep code to
      handle this, it causes a state that lockdep can not handle.
      
      The preempt tracer will reset the ring buffer on start of a trace,
      and the ring buffer reset code does a spin_lock_irqsave. This
      calls into lockdep and lockdep will fail when it detects the
      invalid state of having softirqs disabled but the internal
      current->softirqs_enabled is still set.
      
      The fix is to manually add the SOFTIRQ_OFFSET to preempt count
      and call the preempt tracer code outside the lockdep critical
      area.
      
      Thanks to Peter Zijlstra for suggesting this solution.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7e49fcce
    • S
      trace: fix logic to start/stop counting · b06a8301
      Steven Rostedt 提交于
      The logic in the tracing_start/stop code prevents the WARN_ON
      from ever detecting if a start/stop pair was mismatched.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b06a8301
    • S
      trace: remove internal irqsoff disabling for trace output · 94523e81
      Steven Rostedt 提交于
      Impact: cleanup of duplicate features
      
      The trace output disables the ring buffer and prevents tracing to
      occur. The code in irqsoff to do the same thing is no longer needed.
      This patch removes it.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94523e81
  3. 22 1月, 2009 5 次提交
  4. 20 1月, 2009 5 次提交
  5. 17 1月, 2009 2 次提交
  6. 16 1月, 2009 12 次提交
  7. 15 1月, 2009 10 次提交
    • P
      sched: fix update_min_vruntime · e17036da
      Peter Zijlstra 提交于
      Impact: fix SCHED_IDLE latency problems
      
      OK, so we have 1 running task A (which is obviously curr and the tree is
      equally obviously empty).
      
      'A' nicely chugs along, doing its thing, carrying min_vruntime along as it
      goes.
      
      Then some whacko speed freak SCHED_IDLE task gets inserted due to SMP
      balancing, which is very likely far right, in that case
      
      update_curr
        update_min_vruntime
          cfs_rq->rb_leftmost := true (the crazy task sitting in a tree)
            vruntime = se->vruntime
      
      and voila, min_vruntime is waaay right of where it ought to be.
      
      OK, so why did I write it like that to begin with...
      
      Aah, yes.
      
      Say we've just dequeued current
      
      schedule
        deactivate_task(prev)
          dequeue_entity
            update_min_vruntime
      
      Then we'll set
      
        vruntime = cfs_rq->min_vruntime;
      
      we find !cfs_rq->curr, but do find someone in the tree. Then we _must_
      do vruntime = se->vruntime, because
      
       vruntime = min_vruntime(vruntime := cfs_rq->min_vruntime, se->vruntime)
      
      will not advance vruntime, and cause lags the other way around (which we
      fixed with that initial patch: 1af5f730
      (sched: more accurate min_vruntime accounting).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e17036da
    • P
      sched: SCHED_OTHER vs SCHED_IDLE isolation · 6bc912b7
      Peter Zijlstra 提交于
      Stronger SCHED_IDLE isolation:
      
       - no SCHED_IDLE buddies
       - never let SCHED_IDLE preempt on wakeup
       - always preempt SCHED_IDLE on wakeup
       - limit SLEEPER fairness for SCHED_IDLE.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bc912b7
    • P
      sched: SCHED_IDLE weight change · cce7ade8
      Peter Zijlstra 提交于
      Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
      vruntime numbers.
      
      time advanced in 100ms:
      
       weight=2
      
       64765.988352
       67012.881408
       88501.412352
      
       weight=3
      
       35496.181411
       34130.971298
       35497.411573
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cce7ade8
    • P
      sched: fix bandwidth validation for UID grouping · 98a4826b
      Peter Zijlstra 提交于
      Impact: make rt-limit tunables work again
      
      Mark Glines reported:
      
      > I've got an issue on x86-64 where I can't configure the system to allow
      > RT tasks for a non-root user.
      >
      > In 2.6.26.5, I was able to do the following to set things up nicely:
      > echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
      > echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
      >
      > Seems like every value I try to echo into the /sys files returns EINVAL.
      
      For UID grouping we initialize the root group with infinite bandwidth
      which by default is actually more than the global limit, therefore the
      bandwidth check always fails.
      
      Because the root group is a phantom group (for UID grouping) we cannot
      runtime adjust it, therefore we let it reflect the global bandwidth
      settings.
      Reported-by: NMark Glines <mark@glines.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      98a4826b
    • I
      tracing: trace_stat.c cleanup · 55922173
      Ingo Molnar 提交于
      Impact: cleanup
      
      - whitespace / code alignment cleanups
      - avoid unnecessary forward prototype by reordering functions
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      55922173
    • L
      tracing/ftrace: add missing unlock in register_stat_tracer() · 42fab4b2
      Li Zefan 提交于
      We should unlock all_stat_sessions_mutex before returning failure.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      42fab4b2
    • F
      tracing/function-graph-tracer: fix a regression while suspend to disk · 4a2b8dda
      Frederic Weisbecker 提交于
      Impact: fix a crash while kernel image restore
      
      When the function graph tracer is running and while suspend to disk, some racy
      and dangerous things happen against this tracer.
      
      The current task will save its registers including the stack pointer which
      contains the return address hooked by the tracer. But the current task will
      continue to enter other functions after that to save the memory, and then
      it will store other return addresses, and finally loose the old depth which
      matches the return address saved in the old stack (during the registers saving).
      
      So on image restore, the code will return to wrong addresses.
      And there are other things: on restore, the task will have it's "current"
      pointer overwritten during registers restoring....switching from one task to
      another... That would be insane to try to trace function graphs at these
      stages.
      
      This patch makes the function graph tracer listening on power events, making
      it's tracing disabled for the current task (the one that performs the
      hibernation work) while suspend/resume to disk, making the tracing safe
      during hibernation.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4a2b8dda
    • S
      trace: stop all recording to ring buffer on ftrace_dump · 0ee6b6cf
      Steven Rostedt 提交于
      Impact: limit ftrace dump output
      
      Currently ftrace_dump only calls ftrace_kill that is a fast way
      to prevent the function tracer functions from being called (just sets
      a flag and clears the function to call, nothing else). It is better
      to also turn off any recording to the ring buffers as well.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0ee6b6cf
    • L
      ring_buffer: reset write when reserve buffer fail · 6f3b3440
      Lai Jiangshan 提交于
      Impact: reset struct buffer_page.write when interrupt storm
      
      if struct buffer_page.write is not reset, any succedent committing
      will corrupted ring_buffer:
      
      static inline void
      rb_set_commit_to_write(struct ring_buffer_per_cpu *cpu_buffer)
      {
      	......
      		cpu_buffer->commit_page->commit =
      			cpu_buffer->commit_page->write;
      	......
      }
      
      when "if (RB_WARN_ON(cpu_buffer, next_page == reader_page))", ring_buffer
      is disabled, but some reserved buffers may haven't been committed.
      we need reset struct buffer_page.write.
      
      when "if (unlikely(next_page == cpu_buffer->commit_page))", ring_buffer
      is still available, we should not corrupt it.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6f3b3440
    • S
      trace: print ftrace_dump at KERN_EMERG log level · 428aee14
      Steven Rostedt 提交于
      Impact: fix to print out ftrace_dump when expected
      
      I was debugging a hard race condition to only find out that
      after I hit the race, my log level was not at level to show
      KERN_INFO. The time it took to trigger the race was wasted because
      I did not capture the trace.
      
      Since ftrace_dump is only called from kernel oops (and only when
      it is set in the kernel command line to do so), or when a
      developer adds it to their own local tree, the log level of
      the print should be at KERN_EMERG to make sure the print appears.
      
      ftrace_dump is not called by a normal user setup, and will not
      add extra unwanted print out to the console. There is no reason
      it should be at KERN_INFO.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      428aee14