1. 24 11月, 2016 2 次提交
  2. 14 5月, 2016 1 次提交
    • S
      ring-buffer: Prevent overflow of size in ring_buffer_resize() · 59643d15
      Steven Rostedt (Red Hat) 提交于
      If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
      then the DIV_ROUND_UP() will return zero.
      
      Here's the details:
      
        # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb
      
      tracing_entries_write() processes this and converts kb to bytes.
      
       18014398509481980 << 10 = 18446744073709547520
      
      and this is passed to ring_buffer_resize() as unsigned long size.
      
       size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
      
      Where DIV_ROUND_UP(a, b) is (a + b - 1)/b
      
      BUF_PAGE_SIZE is 4080 and here
      
       18446744073709547520 + 4080 - 1 = 18446744073709551599
      
      where 18446744073709551599 is still smaller than 2^64
      
       2^64 - 18446744073709551599 = 17
      
      But now 18446744073709551599 / 4080 = 4521260802379792
      
      and size = size * 4080 = 18446744073709551360
      
      This is checked to make sure its still greater than 2 * 4080,
      which it is.
      
      Then we convert to the number of buffer pages needed.
      
       nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)
      
      but this time size is 18446744073709551360 and
      
       2^64 - (18446744073709551360 + 4080 - 1) = -3823
      
      Thus it overflows and the resulting number is less than 4080, which makes
      
        3823 / 4080 = 0
      
      an nr_pages is set to this. As we already checked against the minimum that
      nr_pages may be, this causes the logic to fail as well, and we crash the
      kernel.
      
      There's no reason to have the two DIV_ROUND_UP() (that's just result of
      historical code changes), clean up the code and fix this bug.
      
      Cc: stable@vger.kernel.org # 3.5+
      Fixes: 83f40318 ("ring-buffer: Make removal of ring buffer pages atomic")
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      59643d15
  3. 13 5月, 2016 1 次提交
    • S
      ring-buffer: Use long for nr_pages to avoid overflow failures · 9b94a8fb
      Steven Rostedt (Red Hat) 提交于
      The size variable to change the ring buffer in ftrace is a long. The
      nr_pages used to update the ring buffer based on the size is int. On 64 bit
      machines this can cause an overflow problem.
      
      For example, the following will cause the ring buffer to crash:
      
       # cd /sys/kernel/debug/tracing
       # echo 10 > buffer_size_kb
       # echo 8556384240 > buffer_size_kb
      
      Then you get the warning of:
      
       WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260
      
      Which is:
      
        RB_WARN_ON(cpu_buffer, nr_removed);
      
      Note each ring buffer page holds 4080 bytes.
      
      This is because:
      
       1) 10 causes the ring buffer to have 3 pages.
          (10kb requires 3 * 4080 pages to hold)
      
       2) (2^31 / 2^10  + 1) * 4080 = 8556384240
          The value written into buffer_size_kb is shifted by 10 and then passed
          to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760
      
       3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
          which is 4080. 8761737461760 / 4080 = 2147484672
      
       4) nr_pages is subtracted from the current nr_pages (3) and we get:
          2147484669. This value is saved in a signed integer nr_pages_to_update
      
       5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
          turns into the value of -2147482627
      
       6) As the value is a negative number, in update_pages_handler() it is
          negated and passed to rb_remove_pages() and 2147482627 pages will
          be removed, which is much larger than 3 and it causes the warning
          because not all the pages asked to be removed were removed.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001
      
      Cc: stable@vger.kernel.org # 2.6.28+
      Fixes: 7a8e76a3 ("tracing: unified trace buffer")
      Reported-by: NHao Qin <QEver.cn@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      9b94a8fb
  4. 26 11月, 2015 1 次提交
  5. 24 11月, 2015 4 次提交
    • S
      ring-buffer: Remove redundant update of page timestamp · 70004986
      Steven Rostedt (Red Hat) 提交于
      The first commit of a buffer page updates the timestamp of that page. No
      need to have the update to the next page add the timestamp too. It will only
      be replaced by the first commit on that page anyway.
      
      Only update to a page if it contains an event.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      70004986
    • S
      ring-buffer: Use READ_ONCE() for most tail_page access · 8573636e
      Steven Rostedt (Red Hat) 提交于
      As cpu_buffer->tail_page may be modified by interrupts at almost any time,
      the flow of logic is very important. Do not let gcc get smart with
      re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
      accesses.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8573636e
    • S
      ring-buffer: Put back the length if crossed page with add_timestamp · bd1b7cd3
      Steven Rostedt (Red Hat) 提交于
      Commit fcc742ea "ring-buffer: Add event descriptor to simplify passing
      data" added a descriptor that holds various data instead of passing around
      several variables through parameters. The problem was that one of the
      parameters was modified in a function and the code was designed not to have
      an effect on that modified  parameter. Now that the parameter is a
      descriptor and any modifications to it are non-volatile, the size of the
      data could be unnecessarily expanded.
      
      Remove the extra space added if a timestamp was added and the event went
      across the page.
      
      Cc: stable@vger.kernel.org # 4.3+
      Fixes: fcc742ea "ring-buffer: Add event descriptor to simplify passing data"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      bd1b7cd3
    • S
      ring-buffer: Update read stamp with first real commit on page · b81f472a
      Steven Rostedt (Red Hat) 提交于
      Do not update the read stamp after swapping out the reader page from the
      write buffer. If the reader page is swapped out of the buffer before an
      event is written to it, then the read_stamp may get an out of date
      timestamp, as the page timestamp is updated on the first commit to that
      page.
      
      rb_get_reader_page() only returns a page if it has an event on it, otherwise
      it will return NULL. At that point, check if the page being returned has
      events and has not been read yet. Then at that point update the read_stamp
      to match the time stamp of the reader page.
      
      Cc: stable@vger.kernel.org # 2.6.30+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b81f472a
  6. 03 11月, 2015 4 次提交
  7. 03 9月, 2015 1 次提交
    • S
      ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated" · b7dc42fd
      Steven Rostedt (Red Hat) 提交于
      The commit a4543a2f "ring-buffer: Get timestamp after event is
      allocated" is needed for some future work. But after adding it, there is a
      race somewhere that causes the saved timestamp to have a slight shift, and
      get ahead of the actual timestamp and make it look like time goes backwards.
      
      I'm still looking into why this happens, but in the mean time, this is
      holding up other work to get in. I'm reverting the change for now (which
      makes the problem go away), and will add it back after I know what is wrong
      and fix it.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b7dc42fd
  8. 21 7月, 2015 5 次提交
    • S
      ring-buffer: Reorganize function locations · d90fd774
      Steven Rostedt (Red Hat) 提交于
      Functions in ring-buffer.c have gotten interleaved between different
      use cases. Move the functions around to get like functions closer
      together. This may or may not help gcc keep cache locality, but it
      makes it a little easier to work with the code.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d90fd774
    • S
      ring-buffer: Make sure event has enough room for extend and padding · 7d75e683
      Steven Rostedt (Red Hat) 提交于
      Now that events only add time extends after it is committed, in case
      an event comes in before it can discard the allocated event, the time
      extend needs to be stored within the event. If the event is bigger
      than then size needed for the time extend, padding must be added.
      The minimum padding size is 8 bytes. Thus if the event is 12 bytes
      (size of time extend + 4), there will not be enough room to add both
      the time extend and padding. Make sure all events are either 8 bytes
      or 16 or more bytes.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      7d75e683
    • S
      ring-buffer: Get timestamp after event is allocated · a4543a2f
      Steven Rostedt (Red Hat) 提交于
      Move the capturing of the timestamp to after an event is allocated.
      If the event is not a commit (where it is an event that preempted
      another event), then no timestamp is needed, because the delta of
      nested events is always zero.
      
      If the event starts on a new page, no delta needs to be calculated
      as the full timestamp will be added to the page header, and the
      event will have a delta of zero.
      
      Now if the event requires a time extend (the delta does not fit
      in the 27 bit delta slot in the header), then the event is discarded,
      the length is extended to hold the TIME_EXTEND event that allows for
      a 59 bit delta, and the commit is tried again.
      
      If the event can't be discarded (another event came in after it),
      then the TIME_EXTEND is added directly to the allocated event and
      the rest of the event is given padding.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a4543a2f
    • S
      ring-buffer: Move the adding of the extended timestamp out of line · 9826b273
      Steven Rostedt (Red Hat) 提交于
      Requiring a extended time stamp is an uncommon occurrence, and it is
      best to do it out of line when needed.
      
      Add a noinline function that handles the extended timestamp and
      have it called with an unlikely to completely move it out of the
      fast path.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      9826b273
    • S
      ring-buffer: Add event descriptor to simplify passing data · fcc742ea
      Steven Rostedt (Red Hat) 提交于
      Add rb_event_info descriptor to pass event info to functions a bit
      easier than using a bunch of parameters. This will also allow for
      changing the code around a bit to find better fast paths.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      fcc742ea
  9. 29 5月, 2015 3 次提交
    • S
      ring-buffer: Add enum names for the context levels · a497adb4
      Steven Rostedt (Red Hat) 提交于
      Instead of having hard coded numbers for the context levels, use
      enums to describe them more.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a497adb4
    • S
      ring-buffer: Remove useless unused tracing_off_permanent() · 3c6296f7
      Steven Rostedt (Red Hat) 提交于
      The tracing_off_permanent() call is a way to disable all ring_buffers.
      Nothing uses it and nothing should use it, as tracing_off() and
      friends are better, as they disable the ring buffers related to
      tracing. The tracing_off_permanent() even disabled non tracing
      ring buffers. This is a bit drastic, and was added to handle NMIs
      doing outputs that could corrupt the ring buffer when only tracing
      used them. It is now obsolete and adds a little overhead, it should
      be removed.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3c6296f7
    • S
      ring-buffer: Give NMIs a chance to lock the reader_lock · 289a5a25
      Steven Rostedt (Red Hat) 提交于
      Currently, if an NMI does a dump of a ring buffer, it disables
      all ring buffers from ever doing any writes again. This is because
      it wont take the locks for the cpu_buffer and this can cause
      corruption if it preempted a read, or a read happens on another
      CPU for the current cpu buffer. This is a bit overkill.
      
      First, it should at least try to take the lock, and if it fails
      then disable it. Also, there's no need to disable all ring
      buffers, even those that are unrelated to what is being read.
      Only disable the per cpu ring buffer that is being read if
      it can not get the lock for it.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      289a5a25
  10. 27 5月, 2015 3 次提交
    • S
      ring-buffer: Add trace_recursive checks to ring_buffer_write() · 985e871b
      Steven Rostedt (Red Hat) 提交于
      The ring_buffer_write() function isn't protected by the trace recursive
      writes. Luckily, this function is not used as much and is unlikely
      to ever recurse. But it should still have the protection, because
      even a call to ring_buffer_lock_reserve() could cause ring buffer
      corruption if called when ring_buffer_write() is being used.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      985e871b
    • S
      ring-buffer: Allways do the trace_recursive checks · 6776221b
      Steven Rostedt (Red Hat) 提交于
      Currently the trace_recursive checks are only done if CONFIG_TRACING
      is enabled. That was because there use to be a dependency with tracing
      for the recursive checks (it used the task_struct trace recursive
      variable). But now it uses its own variable and there is no dependency.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      6776221b
    • S
      ring-buffer: Move recursive check to per_cpu descriptor · 58a09ec6
      Steven Rostedt (Red Hat) 提交于
      Instead of using a global per_cpu variable to perform the recursive
      checks into the ring buffer, use the already existing per_cpu descriptor
      that is part of the ring buffer itself.
      
      Not only does this simplify the code, it also allows for one ring buffer
      to be used within the guts of the use of another ring buffer. For example
      trace_printk() can now be used within the ring buffer to record changes
      done by an instance into the main ring buffer. The recursion checks
      will prevent the trace_printk() itself from causing recursive issues
      with the main ring buffer (it is just ignored), but the recursive
      checks wont prevent the trace_printk() from recording other ring buffers.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      58a09ec6
  11. 22 5月, 2015 1 次提交
    • S
      ring-buffer: Add unlikelys to make fast path the default · 3205f806
      Steven Rostedt (Red Hat) 提交于
      I was running the trace_event benchmark and noticed that the times
      to record a trace_event was all over the place. I looked at the assembly
      of the ring_buffer_lock_reserver() and saw this:
      
       <ring_buffer_lock_reserve>:
             31 c0                   xor    %eax,%eax
             48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
             01
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             75 1d                   jne    ffffffff8113c60d <ring_buffer_lock_reserve+0x2d>
             65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
             8b 47 08                mov    0x8(%rdi),%eax
             85 c0                   test   %eax,%eax
       +---- 74 12                   je     ffffffff8113c610 <ring_buffer_lock_reserve+0x30>
       |     65 ff 0d 5b e3 ec 7e    decl   %gs:0x7eece35b(%rip)        # a960 <__preempt_count>
       |     0f 84 85 00 00 00       je     ffffffff8113c690 <ring_buffer_lock_reserve+0xb0>
       |     31 c0                   xor    %eax,%eax
       |     5d                      pop    %rbp
       |     c3                      retq
       |     90                      nop
       +---> 65 44 8b 05 48 e3 ec    mov    %gs:0x7eece348(%rip),%r8d        # a960 <__preempt_count>
             7e
             41 81 e0 ff ff ff 7f    and    $0x7fffffff,%r8d
             b0 08                   mov    $0x8,%al
             65 8b 0d 58 36 ed 7e    mov    %gs:0x7eed3658(%rip),%ecx        # fc80 <current_context>
             41 f7 c0 00 ff 1f 00    test   $0x1fff00,%r8d
             74 1e                   je     ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
             41 f7 c0 00 00 10 00    test   $0x100000,%r8d
             b0 01                   mov    $0x1,%al
             75 13                   jne    ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
             41 81 e0 00 00 0f 00    and    $0xf0000,%r8d
             49 83 f8 01             cmp    $0x1,%r8
             19 c0                   sbb    %eax,%eax
             83 e0 02                and    $0x2,%eax
             83 c0 02                add    $0x2,%eax
             85 c8                   test   %ecx,%eax
             75 ab                   jne    ffffffff8113c5fe <ring_buffer_lock_reserve+0x1e>
             09 c8                   or     %ecx,%eax
             65 89 05 24 36 ed 7e    mov    %eax,%gs:0x7eed3624(%rip)        # fc80 <current_context>
      
      The arrow is the fast path.
      
      After adding the unlikely's, the fast path looks a bit better:
      
       <ring_buffer_lock_reserve>:
             31 c0                   xor    %eax,%eax
             48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
             01
             55                      push   %rbp
             48 89 e5                mov    %rsp,%rbp
             75 7b                   jne    ffffffff8113c66b <ring_buffer_lock_reserve+0x8b>
             65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
             8b 47 08                mov    0x8(%rdi),%eax
             85 c0                   test   %eax,%eax
             0f 85 9f 00 00 00       jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
             65 8b 0d 57 e3 ec 7e    mov    %gs:0x7eece357(%rip),%ecx        # a960 <__preempt_count>
             81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
             b0 08                   mov    $0x8,%al
             65 8b 15 68 36 ed 7e    mov    %gs:0x7eed3668(%rip),%edx        # fc80 <current_context>
             f7 c1 00 ff 1f 00       test   $0x1fff00,%ecx
             75 50                   jne    ffffffff8113c670 <ring_buffer_lock_reserve+0x90>
             85 d0                   test   %edx,%eax
             75 7d                   jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
             09 d0                   or     %edx,%eax
             65 89 05 53 36 ed 7e    mov    %eax,%gs:0x7eed3653(%rip)        # fc80 <current_context>
             65 8b 05 fc da ec 7e    mov    %gs:0x7eecdafc(%rip),%eax        # a130 <cpu_number>
             89 c2                   mov    %eax,%edx
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3205f806
  12. 14 5月, 2015 1 次提交
  13. 31 3月, 2015 1 次提交
  14. 25 3月, 2015 1 次提交
    • S
      ring-buffer: Replace this_cpu_*() with __this_cpu_*() · 80a9b64e
      Steven Rostedt 提交于
      It has come to my attention that this_cpu_read/write are horrible on
      architectures other than x86. Worse yet, they actually disable
      preemption or interrupts! This caused some unexpected tracing results
      on ARM.
      
         101.356868: preempt_count_add <-ring_buffer_lock_reserve
         101.356870: preempt_count_sub <-ring_buffer_lock_reserve
      
      The ring_buffer_lock_reserve has recursion protection that requires
      accessing a per cpu variable. But since preempt_disable() is traced, it
      too got traced while accessing the variable that is suppose to prevent
      recursion like this.
      
      The generic version of this_cpu_read() and write() are:
      
       #define this_cpu_generic_read(pcp)					\
       ({	typeof(pcp) ret__;						\
      	preempt_disable();						\
      	ret__ = *this_cpu_ptr(&(pcp));					\
      	preempt_enable();						\
      	ret__;								\
       })
      
       #define this_cpu_generic_to_op(pcp, val, op)				\
       do {									\
      	unsigned long flags;						\
      	raw_local_irq_save(flags);					\
      	*__this_cpu_ptr(&(pcp)) op val;					\
      	raw_local_irq_restore(flags);					\
       } while (0)
      
      Which is unacceptable for locations that know they are within preempt
      disabled or interrupt disabled locations.
      
      Paul McKenney stated that __this_cpu_() versions produce much better code on
      other architectures than this_cpu_() does, if we know that the call is done in
      a preempt disabled location.
      
      I also changed the recursive_unlock() to use two local variables instead
      of accessing the per_cpu variable twice.
      
      Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home
      
      Cc: stable@vger.kernel.org
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reported-by: NUwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
      Tested-by: NUwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      80a9b64e
  15. 11 2月, 2015 1 次提交
    • S
      ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714
      Steven Rostedt (Red Hat) 提交于
      When an application connects to the ring buffer via splice, it can only
      read full pages. Splice does not work with partial pages. If there is
      not enough data to fill a page, the splice command will either block
      or return -EAGAIN (if set to nonblock).
      
      Code was added where if the page is not full, to just sleep again.
      The problem is, it will get woken up again on the next event. That
      is, when something is written into the ring buffer, if there is a waiter
      it will wake it up. The waiter would then check the buffer, see that
      it still does not have enough data to fill a page and go back to sleep.
      To make matters worse, when the waiter goes back to sleep, it could
      cause another event, which would wake it back up again to see it
      doesn't have enough data and sleep again. This produces a tremendous
      overhead and fills the ring buffer with noise.
      
      For example, recording sched_switch on an idle system for 10 seconds
      produces 25,350,475 events!!!
      
      Create another wait queue for those waiters wanting full pages.
      When an event is written, it only wakes up waiters if there's a full
      page of data. It does not wake up the waiter if the page is not yet
      full.
      
      After this change, recording sched_switch on an idle system for 10
      seconds produces only 800 events. Getting rid of 25,349,675 useless
      events (99.9969% of events!!), is something to take seriously.
      
      Cc: stable@vger.kernel.org # 3.16+
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa "tracing: Do not busy wait in buffer splice"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1e0d6714
  16. 23 1月, 2015 1 次提交
  17. 20 11月, 2014 1 次提交
  18. 11 11月, 2014 1 次提交
    • R
      tracing: Do not busy wait in buffer splice · e30f53aa
      Rabin Vincent 提交于
      On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
      lockup:
      
       # trace-cmd record -e raw_syscalls:* -F false
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
       ...
       Call Trace:
        [<ffffffff8105b580>] ? __wake_up_common+0x90/0x90
        [<ffffffff81092e25>] wait_on_pipe+0x35/0x40
        [<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0
        [<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0
        [<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40
        [<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290
        [<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0
        [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
        [<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60
        [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
        [<ffffffff8112415d>] do_splice_to+0x6d/0x90
        [<ffffffff81126971>] SyS_splice+0x7c1/0x800
        [<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8
      
      The problem is this: tracing_buffers_splice_read() calls
      ring_buffer_wait() to wait for data in the ring buffers.  The buffers
      are not empty so ring_buffer_wait() returns immediately.  But
      tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
      meaning it only wants to read a full page.  When the full page is not
      available, tracing_buffers_splice_read() tries to wait again with
      ring_buffer_wait(), which again returns immediately, and so on.
      
      Fix this by adding a "full" argument to ring_buffer_wait() which will
      make ring_buffer_wait() wait until the writer has left the reader's
      page, i.e.  until full-page reads will succeed.
      
      Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in
      
      Cc: stable@vger.kernel.org # 3.16+
      Fixes: b1169cc6 ("tracing: Remove mock up poll wait function")
      Signed-off-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e30f53aa
  19. 03 10月, 2014 1 次提交
    • S
      ring-buffer: Fix infinite spin in reading buffer · 24607f11
      Steven Rostedt (Red Hat) 提交于
      Commit 651e22f2 "ring-buffer: Always reset iterator to reader page"
      fixed one bug but in the process caused another one. The reset is to
      update the header page, but that fix also changed the way the cached
      reads were updated. The cache reads are used to test if an iterator
      needs to be updated or not.
      
      A ring buffer iterator, when created, disables writes to the ring buffer
      but does not stop other readers or consuming reads from happening.
      Although all readers are synchronized via a lock, they are only
      synchronized when in the ring buffer functions. Those functions may
      be called by any number of readers. The iterator continues down when
      its not interrupted by a consuming reader. If a consuming read
      occurs, the iterator starts from the beginning of the buffer.
      
      The way the iterator sees that a consuming read has happened since
      its last read is by checking the reader "cache". The cache holds the
      last counts of the read and the reader page itself.
      
      Commit 651e22f2 changed what was saved by the cache_read when
      the rb_iter_reset() occurred, making the iterator never match the cache.
      Then if the iterator calls rb_iter_reset(), it will go into an
      infinite loop by checking if the cache doesn't match, doing the reset
      and retrying, just to see that the cache still doesn't match! Which
      should never happen as the reset is suppose to set the cache to the
      current value and there's locks that keep a consuming reader from
      having access to the data.
      
      Fixes: 651e22f2 "ring-buffer: Always reset iterator to reader page"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      24607f11
  20. 26 8月, 2014 1 次提交
    • J
      trace: Fix epoll hang when we race with new entries · 4ce97dbf
      Josef Bacik 提交于
      Epoll on trace_pipe can sometimes hang in a weird case.  If the ring buffer is
      empty when we set waiters_pending but an event shows up exactly at that moment
      we can miss being woken up by the ring buffers irq work.  Since
      ring_buffer_empty() is inherently racey we will sometimes think that the buffer
      is not empty.  So we don't get woken up and we don't think there are any events
      even though there were some ready when we added the watch, which makes us hang.
      This patch fixes this by making sure that we are actually on the wait list
      before we set waiters_pending, and add a memory barrier to make sure
      ring_buffer_empty() is going to be correct.
      
      Link: http://lkml.kernel.org/p/1408989581-23727-1-git-send-email-jbacik@fb.com
      
      Cc: stable@vger.kernel.org # 3.10+
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      4ce97dbf
  21. 07 8月, 2014 2 次提交
    • S
      ring-buffer: Always reset iterator to reader page · 651e22f2
      Steven Rostedt (Red Hat) 提交于
      When performing a consuming read, the ring buffer swaps out a
      page from the ring buffer with a empty page and this page that
      was swapped out becomes the new reader page. The reader page
      is owned by the reader and since it was swapped out of the ring
      buffer, writers do not have access to it (there's an exception
      to that rule, but it's out of scope for this commit).
      
      When reading the "trace" file, it is a non consuming read, which
      means that the data in the ring buffer will not be modified.
      When the trace file is opened, a ring buffer iterator is allocated
      and writes to the ring buffer are disabled, such that the iterator
      will not have issues iterating over the data.
      
      Although the ring buffer disabled writes, it does not disable other
      reads, or even consuming reads. If a consuming read happens, then
      the iterator is reset and starts reading from the beginning again.
      
      My tests would sometimes trigger this bug on my i386 box:
      
      WARNING: CPU: 0 PID: 5175 at kernel/trace/trace.c:1527 __trace_find_cmdline+0x66/0xaa()
      Modules linked in:
      CPU: 0 PID: 5175 Comm: grep Not tainted 3.16.0-rc3-test+ #8
      Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
       00000000 00000000 f09c9e1c c18796b3 c1b5d74c f09c9e4c c103a0e3 c1b5154b
       f09c9e78 00001437 c1b5d74c 000005f7 c10bd85a c10bd85a c1cac57c f09c9eb0
       ed0e0000 f09c9e64 c103a185 00000009 f09c9e5c c1b5154b f09c9e78 f09c9e80^M
      Call Trace:
       [<c18796b3>] dump_stack+0x4b/0x75
       [<c103a0e3>] warn_slowpath_common+0x7e/0x95
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c10bd85a>] ? __trace_find_cmdline+0x66/0xaa
       [<c103a185>] warn_slowpath_fmt+0x33/0x35
       [<c10bd85a>] __trace_find_cmdline+0x66/0xaa^M
       [<c10bed04>] trace_find_cmdline+0x40/0x64
       [<c10c3c16>] trace_print_context+0x27/0xec
       [<c10c4360>] ? trace_seq_printf+0x37/0x5b
       [<c10c0b15>] print_trace_line+0x319/0x39b
       [<c10ba3fb>] ? ring_buffer_read+0x47/0x50
       [<c10c13b1>] s_show+0x192/0x1ab
       [<c10bfd9a>] ? s_next+0x5a/0x7c
       [<c112e76e>] seq_read+0x267/0x34c
       [<c1115a25>] vfs_read+0x8c/0xef
       [<c112e507>] ? seq_lseek+0x154/0x154
       [<c1115ba2>] SyS_read+0x54/0x7f
       [<c188488e>] syscall_call+0x7/0xb
      ---[ end trace 3f507febd6b4cc83 ]---
      >>>> ##### CPU 1 buffer started ####
      
      Which was the __trace_find_cmdline() function complaining about the pid
      in the event record being negative.
      
      After adding more test cases, this would trigger more often. Strangely
      enough, it would never trigger on a single test, but instead would trigger
      only when running all the tests. I believe that was the case because it
      required one of the tests to be shutting down via delayed instances while
      a new test started up.
      
      After spending several days debugging this, I found that it was caused by
      the iterator becoming corrupted. Debugging further, I found out why
      the iterator became corrupted. It happened with the rb_iter_reset().
      
      As consuming reads may not read the full reader page, and only part
      of it, there's a "read" field to know where the last read took place.
      The iterator, must also start at the read position. In the rb_iter_reset()
      code, if the reader page was disconnected from the ring buffer, the iterator
      would start at the head page within the ring buffer (where writes still
      happen). But the mistake there was that it still used the "read" field
      to start the iterator on the head page, where it should always start
      at zero because readers never read from within the ring buffer where
      writes occur.
      
      I originally wrote a patch to have it set the iter->head to 0 instead
      of iter->head_page->read, but then I questioned why it wasn't always
      setting the iter to point to the reader page, as the reader page is
      still valid.  The list_empty(reader_page->list) just means that it was
      successful in swapping out. But the reader_page may still have data.
      
      There was a bug report a long time ago that was not reproducible that
      had something about trace_pipe (consuming read) not matching trace
      (iterator read). This may explain why that happened.
      
      Anyway, the correct answer to this bug is to always use the reader page
      an not reset the iterator to inside the writable ring buffer.
      
      Cc: stable@vger.kernel.org # 2.6.28+
      Fixes: d769041f "ring_buffer: implement new locking"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      651e22f2
    • S
      ring-buffer: Up rb_iter_peek() loop count to 3 · 021de3d9
      Steven Rostedt (Red Hat) 提交于
      After writting a test to try to trigger the bug that caused the
      ring buffer iterator to become corrupted, I hit another bug:
      
       WARNING: CPU: 1 PID: 5281 at kernel/trace/ring_buffer.c:3766 rb_iter_peek+0x113/0x238()
       Modules linked in: ipt_MASQUERADE sunrpc [...]
       CPU: 1 PID: 5281 Comm: grep Tainted: G        W     3.16.0-rc3-test+ #143
       Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
        0000000000000000 ffffffff81809a80 ffffffff81503fb0 0000000000000000
        ffffffff81040ca1 ffff8800796d6010 ffffffff810c138d ffff8800796d6010
        ffff880077438c80 ffff8800796d6010 ffff88007abbe600 0000000000000003
       Call Trace:
        [<ffffffff81503fb0>] ? dump_stack+0x4a/0x75
        [<ffffffff81040ca1>] ? warn_slowpath_common+0x7e/0x97
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c138d>] ? rb_iter_peek+0x113/0x238
        [<ffffffff810c14df>] ? ring_buffer_iter_peek+0x2d/0x5c
        [<ffffffff810c6f73>] ? tracing_iter_reset+0x6e/0x96
        [<ffffffff810c74a3>] ? s_start+0xd7/0x17b
        [<ffffffff8112b13e>] ? kmem_cache_alloc_trace+0xda/0xea
        [<ffffffff8114cf94>] ? seq_read+0x148/0x361
        [<ffffffff81132d98>] ? vfs_read+0x93/0xf1
        [<ffffffff81132f1b>] ? SyS_read+0x60/0x8e
        [<ffffffff8150bf9f>] ? tracesys+0xdd/0xe2
      
      Debugging this bug, which triggers when the rb_iter_peek() loops too
      many times (more than 2 times), I discovered there's a case that can
      cause that function to legitimately loop 3 times!
      
      rb_iter_peek() is different than rb_buffer_peek() as the rb_buffer_peek()
      only deals with the reader page (it's for consuming reads). The
      rb_iter_peek() is for traversing the buffer without consuming it, and as
      such, it can loop for one more reason. That is, if we hit the end of
      the reader page or any page, it will go to the next page and try again.
      
      That is, we have this:
      
       1. iter->head > iter->head_page->page->commit
          (rb_inc_iter() which moves the iter to the next page)
          try again
      
       2. event = rb_iter_head_event()
          event->type_len == RINGBUF_TYPE_TIME_EXTEND
          rb_advance_iter()
          try again
      
       3. read the event.
      
      But we never get to 3, because the count is greater than 2 and we
      cause the WARNING and return NULL.
      
      Up the counter to 3.
      
      Cc: stable@vger.kernel.org # 2.6.37+
      Fixes: 69d1b839 "ring-buffer: Bind time extend and data events together"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      021de3d9
  22. 24 7月, 2014 1 次提交
  23. 19 7月, 2014 1 次提交
  24. 16 7月, 2014 1 次提交
    • M
      ring-buffer: Fix polling on trace_pipe · 97b8ee84
      Martin Lau 提交于
      ring_buffer_poll_wait() should always put the poll_table to its wait_queue
      even there is immediate data available.  Otherwise, the following epoll and
      read sequence will eventually hang forever:
      
      1. Put some data to make the trace_pipe ring_buffer read ready first
      2. epoll_ctl(efd, EPOLL_CTL_ADD, trace_pipe_fd, ee)
      3. epoll_wait()
      4. read(trace_pipe_fd) till EAGAIN
      5. Add some more data to the trace_pipe ring_buffer
      6. epoll_wait() -> this epoll_wait() will block forever
      
      ~ During the epoll_ctl(efd, EPOLL_CTL_ADD,...) call in step 2,
        ring_buffer_poll_wait() returns immediately without adding poll_table,
        which has poll_table->_qproc pointing to ep_poll_callback(), to its
        wait_queue.
      ~ During the epoll_wait() call in step 3 and step 6,
        ring_buffer_poll_wait() cannot add ep_poll_callback() to its wait_queue
        because the poll_table->_qproc is NULL and it is how epoll works.
      ~ When there is new data available in step 6, ring_buffer does not know
        it has to call ep_poll_callback() because it is not in its wait queue.
        Hence, block forever.
      
      Other poll implementation seems to call poll_wait() unconditionally as the very
      first thing to do.  For example, tcp_poll() in tcp.c.
      
      Link: http://lkml.kernel.org/p/20140610060637.GA14045@devbig242.prn2.facebook.com
      
      Cc: stable@vger.kernel.org # 2.6.27
      Fixes: 2a2cc8f7 "ftrace: allow the event pipe to be polled"
      Reviewed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NMartin Lau <kafai@fb.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      97b8ee84