提交 · 73a757e63114dfd765f1c5d1ff7e994f123d0234 · openeuler / raspberrypi-kernel

01 5月, 2017 1 次提交

ring-buffer: Return reader page back into existing ring buffer · 73a757e6

由 Steven Rostedt (VMware) 提交于 5月 01, 2017

When reading the ring buffer for consuming, it is optimized for splice,
where a page is taken out of the ring buffer (zero copy) and sent to the
reading consumer. When the read is finished with the page, it calls
ring_buffer_free_read_page(), which simply frees the page. The next time the
reader needs to get a page from the ring buffer, it must call
ring_buffer_alloc_read_page() which allocates and initializes a reader page
for the ring buffer to be swapped into the ring buffer for a new filled page
for the reader.

The problem is that there's no reason to actually free the page when it is
passed back to the ring buffer. It can hold it off and reuse it for the next
iteration. This completely removes the interaction with the page_alloc
mechanism.

Using the trace-cmd utility to record all events (causing trace-cmd to
require reading lots of pages from the ring buffer, and calling
ring_buffer_alloc/free_read_page() several times), and also assigning a
stack trace trigger to the mm_page_alloc event, we can see how many times
the ring_buffer_alloc_read_page() needed to allocate a page for the ring
buffer.

Before this change:

  # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
  # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
  9968

After this change:

  # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1
  # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l
  4
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

73a757e6

02 3月, 2017 1 次提交

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> · e6017571

由 Ingo Molnar 提交于 2月 01, 2017

We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
will have to be picked up from other headers and .c files.

Create a trivial placeholder <linux/sched/clock.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

e6017571

13 12月, 2016 1 次提交

tracing/rb: Init the CPU mask on allocation · 99e6f6e8

由 Sebastian Andrzej Siewior 提交于 12月 07, 2016

Before commit b32614c0 ("tracing/rb: Convert to hotplug state
machine") the allocated cpumask was initialized to the mask of ONLINE or
POSSIBLE CPUs. After the CPU hotplug changes the buffer initialisation
moved to trace_rb_cpu_prepare() but I forgot to initially set the
cpumask to zero. This is done now.

Link: http://lkml.kernel.org/r/20161207133133.hzkcqfllxcdi3joz@linutronix.de

Fixes: b32614c0 ("tracing/rb: Convert to hotplug state machine")
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

99e6f6e8

07 12月, 2016 1 次提交

tracing/rb: Init the CPU mask on allocation · b18cc3de

由 Sebastian Andrzej Siewior 提交于 12月 07, 2016

Before commit b32614c0 ("tracing/rb: Convert to hotplug state machine")
the allocated cpumask was initialized to the mask of online or possible
CPUs. After the CPU hotplug changes the buffer initialization moved to
trace_rb_cpu_prepare() but the cpumask is allocated with alloc_cpumask()
and therefor has random content. As a consequence the cpu buffers are not
initialized and a later access dereferences a NULL pointer.

Use zalloc_cpumask() instead so trace_rb_cpu_prepare() initializes the
buffers properly.

Fixes: b32614c0 ("tracing/rb: Convert to hotplug state machine")
Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20161207133133.hzkcqfllxcdi3joz@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

b18cc3de

02 12月, 2016 1 次提交

tracing/rb: Convert to hotplug state machine · b32614c0

由 Sebastian Andrzej Siewior 提交于 11月 27, 2016

Install the callbacks via the state machine. The notifier in struct
ring_buffer is replaced by the multi instance interface.  Upon
__ring_buffer_alloc() invocation, cpuhp_state_add_instance() will invoke
the trace_rb_cpu_prepare() on each CPU.

This callback may now fail. This means __ring_buffer_alloc() will fail and
cleanup (like previously) and during a CPU up event this failure will not
allow the CPU to come up.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20161126231350.10321-7-bigeasy@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

b32614c0

24 11月, 2016 5 次提交

ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline · 38e11df1

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2016

Both rb_end_commit() and rb_set_commit_to_write() are in the fast path of
the ring buffer recording. Make sure they are always inlined.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

38e11df1

ring-buffer: Froce rb_update_write_stamp() to be inlined · babe3fce

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2016

The function rb_update_write_stamp() is in the hotpath of the ring buffer
recording. Make sure that it is inlined as well. There's not many places
that call it.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

babe3fce

ring-buffer: Force inline of hotpath helper functions · 2289d567

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2016

There's several small helper functions in ring_buffer.c that are used in the
hot path. For some reason, even though they are marked inline, gcc tends not
to enforce it. Make sure these functions are always inlined.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

2289d567

ring-buffer: Always inline rb_event_data() · 929ddbf3

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2016

The rb_event_data() is the fast path of getting the ring buffer data from an
event. Externally, ring_buffer_event_data() is used to access this function.
But unfortunately, rb_event_data() is not inlined, and calling
ring_buffer_event_data() causes that function to be called again. Force
rb_event_data() to be inlined to lower the number of operations needed when
calling ring_buffer_event_data().

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

929ddbf3

ring-buffer: Make rb_reserve_next_event() always inlined · fa7ffb39

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2016

The function rb_reserved_next_event() is called by two functions:
ring_buffer_lock_reserve() and ring_buffer_write(). This is in a very hot
path of the tracing code, and it is best that they are not functions. The
two callers are basically wrapers for rb_reserver_next_event(). Removing the
function calls can save execution time in the hotpath of tracing.

Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

fa7ffb39

14 5月, 2016 1 次提交

ring-buffer: Prevent overflow of size in ring_buffer_resize() · 59643d15

由 Steven Rostedt (Red Hat) 提交于 5月 13, 2016

If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE
then the DIV_ROUND_UP() will return zero.

Here's the details:

  # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb

tracing_entries_write() processes this and converts kb to bytes.

 18014398509481980 << 10 = 18446744073709547520

and this is passed to ring_buffer_resize() as unsigned long size.

 size = DIV_ROUND_UP(size, BUF_PAGE_SIZE);

Where DIV_ROUND_UP(a, b) is (a + b - 1)/b

BUF_PAGE_SIZE is 4080 and here

 18446744073709547520 + 4080 - 1 = 18446744073709551599

where 18446744073709551599 is still smaller than 2^64

 2^64 - 18446744073709551599 = 17

But now 18446744073709551599 / 4080 = 4521260802379792

and size = size * 4080 = 18446744073709551360

This is checked to make sure its still greater than 2 * 4080,
which it is.

Then we convert to the number of buffer pages needed.

 nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE)

but this time size is 18446744073709551360 and

 2^64 - (18446744073709551360 + 4080 - 1) = -3823

Thus it overflows and the resulting number is less than 4080, which makes

  3823 / 4080 = 0

an nr_pages is set to this. As we already checked against the minimum that
nr_pages may be, this causes the logic to fail as well, and we crash the
kernel.

There's no reason to have the two DIV_ROUND_UP() (that's just result of
historical code changes), clean up the code and fix this bug.

Cc: stable@vger.kernel.org # 3.5+
Fixes: 83f40318 ("ring-buffer: Make removal of ring buffer pages atomic")
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

59643d15

13 5月, 2016 1 次提交

ring-buffer: Use long for nr_pages to avoid overflow failures · 9b94a8fb

由 Steven Rostedt (Red Hat) 提交于 5月 12, 2016

The size variable to change the ring buffer in ftrace is a long. The
nr_pages used to update the ring buffer based on the size is int. On 64 bit
machines this can cause an overflow problem.

For example, the following will cause the ring buffer to crash:

 # cd /sys/kernel/debug/tracing
 # echo 10 > buffer_size_kb
 # echo 8556384240 > buffer_size_kb

Then you get the warning of:

 WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260

Which is:

  RB_WARN_ON(cpu_buffer, nr_removed);

Note each ring buffer page holds 4080 bytes.

This is because:

 1) 10 causes the ring buffer to have 3 pages.
    (10kb requires 3 * 4080 pages to hold)

 2) (2^31 / 2^10  + 1) * 4080 = 8556384240
    The value written into buffer_size_kb is shifted by 10 and then passed
    to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760

 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE
    which is 4080. 8761737461760 / 4080 = 2147484672

 4) nr_pages is subtracted from the current nr_pages (3) and we get:
    2147484669. This value is saved in a signed integer nr_pages_to_update

 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int
    turns into the value of -2147482627

 6) As the value is a negative number, in update_pages_handler() it is
    negated and passed to rb_remove_pages() and 2147482627 pages will
    be removed, which is much larger than 3 and it causes the warning
    because not all the pages asked to be removed were removed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001

Cc: stable@vger.kernel.org # 2.6.28+
Fixes: 7a8e76a3 ("tracing: unified trace buffer")
Reported-by: NHao Qin <QEver.cn@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

9b94a8fb

26 11月, 2015 1 次提交

ring-buffer: Process commits whenever moving to a new page. · 4239c38f

由 Steven Rostedt (Red Hat) 提交于 11月 17, 2015

When crossing over to a new page, commit the current work. This will allow
readers to get data with less latency, and also simplifies the work to get
timestamps working for interrupted events.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

4239c38f

24 11月, 2015 4 次提交

ring-buffer: Remove redundant update of page timestamp · 70004986

由 Steven Rostedt (Red Hat) 提交于 11月 17, 2015

The first commit of a buffer page updates the timestamp of that page. No
need to have the update to the next page add the timestamp too. It will only
be replaced by the first commit on that page anyway.

Only update to a page if it contains an event.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

70004986

ring-buffer: Use READ_ONCE() for most tail_page access · 8573636e

由 Steven Rostedt (Red Hat) 提交于 11月 17, 2015

As cpu_buffer->tail_page may be modified by interrupts at almost any time,
the flow of logic is very important. Do not let gcc get smart with
re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its
accesses.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

8573636e

ring-buffer: Put back the length if crossed page with add_timestamp · bd1b7cd3

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2015

Commit fcc742ea "ring-buffer: Add event descriptor to simplify passing
data" added a descriptor that holds various data instead of passing around
several variables through parameters. The problem was that one of the
parameters was modified in a function and the code was designed not to have
an effect on that modified  parameter. Now that the parameter is a
descriptor and any modifications to it are non-volatile, the size of the
data could be unnecessarily expanded.

Remove the extra space added if a timestamp was added and the event went
across the page.

Cc: stable@vger.kernel.org # 4.3+
Fixes: fcc742ea "ring-buffer: Add event descriptor to simplify passing data"
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

bd1b7cd3

ring-buffer: Update read stamp with first real commit on page · b81f472a

由 Steven Rostedt (Red Hat) 提交于 11月 23, 2015

Do not update the read stamp after swapping out the reader page from the
write buffer. If the reader page is swapped out of the buffer before an
event is written to it, then the read_stamp may get an out of date
timestamp, as the page timestamp is updated on the first commit to that
page.

rb_get_reader_page() only returns a page if it has an event on it, otherwise
it will return NULL. At that point, check if the page being returned has
events and has not been read yet. Then at that point update the read_stamp
to match the time stamp of the reader page.

Cc: stable@vger.kernel.org # 2.6.30+
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

b81f472a

03 11月, 2015 4 次提交

ring-buffer: rb_event_is_commit() can return boolean · cdb2a0a9

由 Yaowei Bai 提交于 9月 29, 2015

Make rb_event_is_commit() return bool to improve readability
due to this particular function only using either one or zero as its
return value.

No functional change.

Link: http://lkml.kernel.org/r/1443537816-5788-7-git-send-email-bywxiaobai@163.comSigned-off-by: NYaowei Bai <bywxiaobai@163.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

cdb2a0a9

ring-buffer: rb_per_cpu_empty() can return boolean · da58834c

由 Yaowei Bai 提交于 9月 29, 2015

Makes rb_per_cpu_empty() return bool to improve readability.

No functional change.

Link: http://lkml.kernel.org/r/1443537816-5788-6-git-send-email-bywxiaobai@163.comSigned-off-by: NYaowei Bai <bywxiaobai@163.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

da58834c

ring_buffer: ring_buffer_empty{cpu}() can return boolean · 3d4e204d

由 Yaowei Bai 提交于 9月 29, 2015

Make ring_buffer_empty() and ring_buffer_empty_cpu() return bool.

No functional change.

Link: http://lkml.kernel.org/r/1443537816-5788-5-git-send-email-bywxiaobai@163.comSigned-off-by: NYaowei Bai <bywxiaobai@163.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

3d4e204d

ring-buffer: rb_is_reader_page() can return boolean · 06ca3209

由 Yaowei Bai 提交于 9月 29, 2015

Make rb_is_reader_page() return bool to improve readability due to this
particular function only using either true or false as its return value.

No functional change.

Link: http://lkml.kernel.org/r/1443537816-5788-4-git-send-email-bywxiaobai@163.comSigned-off-by: NYaowei Bai <bywxiaobai@163.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

06ca3209

03 9月, 2015 1 次提交

ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated" · b7dc42fd

由 Steven Rostedt (Red Hat) 提交于 9月 03, 2015

The commit a4543a2f "ring-buffer: Get timestamp after event is
allocated" is needed for some future work. But after adding it, there is a
race somewhere that causes the saved timestamp to have a slight shift, and
get ahead of the actual timestamp and make it look like time goes backwards.

I'm still looking into why this happens, but in the mean time, this is
holding up other work to get in. I'm reverting the change for now (which
makes the problem go away), and will add it back after I know what is wrong
and fix it.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

b7dc42fd

21 7月, 2015 5 次提交

ring-buffer: Reorganize function locations · d90fd774

由 Steven Rostedt (Red Hat) 提交于 5月 29, 2015

Functions in ring-buffer.c have gotten interleaved between different
use cases. Move the functions around to get like functions closer
together. This may or may not help gcc keep cache locality, but it
makes it a little easier to work with the code.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

d90fd774

ring-buffer: Make sure event has enough room for extend and padding · 7d75e683

由 Steven Rostedt (Red Hat) 提交于 5月 29, 2015

Now that events only add time extends after it is committed, in case
an event comes in before it can discard the allocated event, the time
extend needs to be stored within the event. If the event is bigger
than then size needed for the time extend, padding must be added.
The minimum padding size is 8 bytes. Thus if the event is 12 bytes
(size of time extend + 4), there will not be enough room to add both
the time extend and padding. Make sure all events are either 8 bytes
or 16 or more bytes.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

7d75e683

ring-buffer: Get timestamp after event is allocated · a4543a2f

由 Steven Rostedt (Red Hat) 提交于 5月 29, 2015

Move the capturing of the timestamp to after an event is allocated.
If the event is not a commit (where it is an event that preempted
another event), then no timestamp is needed, because the delta of
nested events is always zero.

If the event starts on a new page, no delta needs to be calculated
as the full timestamp will be added to the page header, and the
event will have a delta of zero.

Now if the event requires a time extend (the delta does not fit
in the 27 bit delta slot in the header), then the event is discarded,
the length is extended to hold the TIME_EXTEND event that allows for
a 59 bit delta, and the commit is tried again.

If the event can't be discarded (another event came in after it),
then the TIME_EXTEND is added directly to the allocated event and
the rest of the event is given padding.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

a4543a2f

ring-buffer: Move the adding of the extended timestamp out of line · 9826b273

由 Steven Rostedt (Red Hat) 提交于 5月 28, 2015

Requiring a extended time stamp is an uncommon occurrence, and it is
best to do it out of line when needed.

Add a noinline function that handles the extended timestamp and
have it called with an unlikely to completely move it out of the
fast path.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

9826b273

ring-buffer: Add event descriptor to simplify passing data · fcc742ea

由 Steven Rostedt (Red Hat) 提交于 5月 28, 2015

Add rb_event_info descriptor to pass event info to functions a bit
easier than using a bunch of parameters. This will also allow for
changing the code around a bit to find better fast paths.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

fcc742ea

29 5月, 2015 3 次提交

ring-buffer: Add enum names for the context levels · a497adb4

由 Steven Rostedt (Red Hat) 提交于 5月 29, 2015

Instead of having hard coded numbers for the context levels, use
enums to describe them more.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

a497adb4

ring-buffer: Remove useless unused tracing_off_permanent() · 3c6296f7

由 Steven Rostedt (Red Hat) 提交于 5月 28, 2015

The tracing_off_permanent() call is a way to disable all ring_buffers.
Nothing uses it and nothing should use it, as tracing_off() and
friends are better, as they disable the ring buffers related to
tracing. The tracing_off_permanent() even disabled non tracing
ring buffers. This is a bit drastic, and was added to handle NMIs
doing outputs that could corrupt the ring buffer when only tracing
used them. It is now obsolete and adds a little overhead, it should
be removed.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

3c6296f7

ring-buffer: Give NMIs a chance to lock the reader_lock · 289a5a25

由 Steven Rostedt (Red Hat) 提交于 5月 28, 2015

Currently, if an NMI does a dump of a ring buffer, it disables
all ring buffers from ever doing any writes again. This is because
it wont take the locks for the cpu_buffer and this can cause
corruption if it preempted a read, or a read happens on another
CPU for the current cpu buffer. This is a bit overkill.

First, it should at least try to take the lock, and if it fails
then disable it. Also, there's no need to disable all ring
buffers, even those that are unrelated to what is being read.
Only disable the per cpu ring buffer that is being read if
it can not get the lock for it.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

289a5a25

27 5月, 2015 3 次提交

ring-buffer: Add trace_recursive checks to ring_buffer_write() · 985e871b

由 Steven Rostedt (Red Hat) 提交于 5月 27, 2015

The ring_buffer_write() function isn't protected by the trace recursive
writes. Luckily, this function is not used as much and is unlikely
to ever recurse. But it should still have the protection, because
even a call to ring_buffer_lock_reserve() could cause ring buffer
corruption if called when ring_buffer_write() is being used.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

985e871b

ring-buffer: Allways do the trace_recursive checks · 6776221b

由 Steven Rostedt (Red Hat) 提交于 5月 27, 2015

Currently the trace_recursive checks are only done if CONFIG_TRACING
is enabled. That was because there use to be a dependency with tracing
for the recursive checks (it used the task_struct trace recursive
variable). But now it uses its own variable and there is no dependency.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

6776221b

ring-buffer: Move recursive check to per_cpu descriptor · 58a09ec6

由 Steven Rostedt (Red Hat) 提交于 5月 27, 2015

Instead of using a global per_cpu variable to perform the recursive
checks into the ring buffer, use the already existing per_cpu descriptor
that is part of the ring buffer itself.

Not only does this simplify the code, it also allows for one ring buffer
to be used within the guts of the use of another ring buffer. For example
trace_printk() can now be used within the ring buffer to record changes
done by an instance into the main ring buffer. The recursion checks
will prevent the trace_printk() itself from causing recursive issues
with the main ring buffer (it is just ignored), but the recursive
checks wont prevent the trace_printk() from recording other ring buffers.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

58a09ec6

22 5月, 2015 1 次提交

ring-buffer: Add unlikelys to make fast path the default · 3205f806

由 Steven Rostedt (Red Hat) 提交于 5月 21, 2015

I was running the trace_event benchmark and noticed that the times
to record a trace_event was all over the place. I looked at the assembly
of the ring_buffer_lock_reserver() and saw this:

 <ring_buffer_lock_reserve>:
       31 c0                   xor    %eax,%eax
       48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
       01
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       75 1d                   jne    ffffffff8113c60d <ring_buffer_lock_reserve+0x2d>
       65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
       8b 47 08                mov    0x8(%rdi),%eax
       85 c0                   test   %eax,%eax
 +---- 74 12                   je     ffffffff8113c610 <ring_buffer_lock_reserve+0x30>
 |     65 ff 0d 5b e3 ec 7e    decl   %gs:0x7eece35b(%rip)        # a960 <__preempt_count>
 |     0f 84 85 00 00 00       je     ffffffff8113c690 <ring_buffer_lock_reserve+0xb0>
 |     31 c0                   xor    %eax,%eax
 |     5d                      pop    %rbp
 |     c3                      retq
 |     90                      nop
 +---> 65 44 8b 05 48 e3 ec    mov    %gs:0x7eece348(%rip),%r8d        # a960 <__preempt_count>
       7e
       41 81 e0 ff ff ff 7f    and    $0x7fffffff,%r8d
       b0 08                   mov    $0x8,%al
       65 8b 0d 58 36 ed 7e    mov    %gs:0x7eed3658(%rip),%ecx        # fc80 <current_context>
       41 f7 c0 00 ff 1f 00    test   $0x1fff00,%r8d
       74 1e                   je     ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
       41 f7 c0 00 00 10 00    test   $0x100000,%r8d
       b0 01                   mov    $0x1,%al
       75 13                   jne    ffffffff8113c64f <ring_buffer_lock_reserve+0x6f>
       41 81 e0 00 00 0f 00    and    $0xf0000,%r8d
       49 83 f8 01             cmp    $0x1,%r8
       19 c0                   sbb    %eax,%eax
       83 e0 02                and    $0x2,%eax
       83 c0 02                add    $0x2,%eax
       85 c8                   test   %ecx,%eax
       75 ab                   jne    ffffffff8113c5fe <ring_buffer_lock_reserve+0x1e>
       09 c8                   or     %ecx,%eax
       65 89 05 24 36 ed 7e    mov    %eax,%gs:0x7eed3624(%rip)        # fc80 <current_context>

The arrow is the fast path.

After adding the unlikely's, the fast path looks a bit better:

 <ring_buffer_lock_reserve>:
       31 c0                   xor    %eax,%eax
       48 83 3d 76 47 bd 00    cmpq   $0x1,0xbd4776(%rip)        # ffffffff81d10d60 <ring_buffer_flags>
       01
       55                      push   %rbp
       48 89 e5                mov    %rsp,%rbp
       75 7b                   jne    ffffffff8113c66b <ring_buffer_lock_reserve+0x8b>
       65 ff 05 69 e3 ec 7e    incl   %gs:0x7eece369(%rip)        # a960 <__preempt_count>
       8b 47 08                mov    0x8(%rdi),%eax
       85 c0                   test   %eax,%eax
       0f 85 9f 00 00 00       jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
       65 8b 0d 57 e3 ec 7e    mov    %gs:0x7eece357(%rip),%ecx        # a960 <__preempt_count>
       81 e1 ff ff ff 7f       and    $0x7fffffff,%ecx
       b0 08                   mov    $0x8,%al
       65 8b 15 68 36 ed 7e    mov    %gs:0x7eed3668(%rip),%edx        # fc80 <current_context>
       f7 c1 00 ff 1f 00       test   $0x1fff00,%ecx
       75 50                   jne    ffffffff8113c670 <ring_buffer_lock_reserve+0x90>
       85 d0                   test   %edx,%eax
       75 7d                   jne    ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1>
       09 d0                   or     %edx,%eax
       65 89 05 53 36 ed 7e    mov    %eax,%gs:0x7eed3653(%rip)        # fc80 <current_context>
       65 8b 05 fc da ec 7e    mov    %gs:0x7eecdafc(%rip),%eax        # a130 <cpu_number>
       89 c2                   mov    %eax,%edx
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

3205f806

14 5月, 2015 1 次提交

tracing: Rename ftrace_event.h to trace_events.h · af658dca

由 Steven Rostedt (Red Hat) 提交于 4月 29, 2015

The term "ftrace" is really the infrastructure of the function hooks,
and not the trace events. Rename ftrace_event.h to trace_events.h to
represent the trace_event infrastructure and decouple the term ftrace
from it.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

af658dca

31 3月, 2015 1 次提交

ring-buffer: Remove duplicate use of '&' in recursive code · d631c8cc

由 Steven Rostedt (Red Hat) 提交于 3月 27, 2015

A clean up of the recursive protection code changed

  val = this_cpu_read(current_context);
  val--;
  val &= this_cpu_read(current_context);

to

  val = this_cpu_read(current_context);
  val &= val & (val - 1);

Which has a duplicate use of '&' as the above is the same as

  val = val & (val - 1);

Actually, it would be best to remove that line altogether and
just add it to where it is used.

And Christoph even mentioned that it can be further compacted to
just a single line:

  __this_cpu_and(current_context, __this_cpu_read(current_context) - 1);

Link: http://lkml.kernel.org/alpine.DEB.2.11.1503271423580.23114@gentwo.orgSuggested-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

d631c8cc

25 3月, 2015 1 次提交

ring-buffer: Replace this_cpu_*() with __this_cpu_*() · 80a9b64e

由 Steven Rostedt 提交于 3月 17, 2015

It has come to my attention that this_cpu_read/write are horrible on
architectures other than x86. Worse yet, they actually disable
preemption or interrupts! This caused some unexpected tracing results
on ARM.

   101.356868: preempt_count_add <-ring_buffer_lock_reserve
   101.356870: preempt_count_sub <-ring_buffer_lock_reserve

The ring_buffer_lock_reserve has recursion protection that requires
accessing a per cpu variable. But since preempt_disable() is traced, it
too got traced while accessing the variable that is suppose to prevent
recursion like this.

The generic version of this_cpu_read() and write() are:

 #define this_cpu_generic_read(pcp)					\
 ({	typeof(pcp) ret__;						\
	preempt_disable();						\
	ret__ = *this_cpu_ptr(&(pcp));					\
	preempt_enable();						\
	ret__;								\
 })

 #define this_cpu_generic_to_op(pcp, val, op)				\
 do {									\
	unsigned long flags;						\
	raw_local_irq_save(flags);					\
	*__this_cpu_ptr(&(pcp)) op val;					\
	raw_local_irq_restore(flags);					\
 } while (0)

Which is unacceptable for locations that know they are within preempt
disabled or interrupt disabled locations.

Paul McKenney stated that __this_cpu_() versions produce much better code on
other architectures than this_cpu_() does, if we know that the call is done in
a preempt disabled location.

I also changed the recursive_unlock() to use two local variables instead
of accessing the per_cpu variable twice.

Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home

Cc: stable@vger.kernel.org
Acked-by: NChristoph Lameter <cl@linux.com>
Reported-by: NUwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Tested-by: NUwe Kleine-Koenig <u.kleine-koenig@pengutronix.de>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

80a9b64e

11 2月, 2015 1 次提交

ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714

由 Steven Rostedt (Red Hat) 提交于 2月 10, 2015

When an application connects to the ring buffer via splice, it can only
read full pages. Splice does not work with partial pages. If there is
not enough data to fill a page, the splice command will either block
or return -EAGAIN (if set to nonblock).

Code was added where if the page is not full, to just sleep again.
The problem is, it will get woken up again on the next event. That
is, when something is written into the ring buffer, if there is a waiter
it will wake it up. The waiter would then check the buffer, see that
it still does not have enough data to fill a page and go back to sleep.
To make matters worse, when the waiter goes back to sleep, it could
cause another event, which would wake it back up again to see it
doesn't have enough data and sleep again. This produces a tremendous
overhead and fills the ring buffer with noise.

For example, recording sched_switch on an idle system for 10 seconds
produces 25,350,475 events!!!

Create another wait queue for those waiters wanting full pages.
When an event is written, it only wakes up waiters if there's a full
page of data. It does not wake up the waiter if the page is not yet
full.

After this change, recording sched_switch on an idle system for 10
seconds produces only 800 events. Getting rid of 25,349,675 useless
events (99.9969% of events!!), is something to take seriously.

Cc: stable@vger.kernel.org # 3.16+
Cc: Rabin Vincent <rabin@rab.in>
Fixes: e30f53aa "tracing: Do not busy wait in buffer splice"
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

1e0d6714

23 1月, 2015 1 次提交

tracing: Remove unneeded includes of debugfs.h and fs.h · 3efb5f21

由 Steven Rostedt (Red Hat) 提交于 1月 20, 2015

The creation of tracing files and directories is for the most part
encapsulated in helper functions in trace.c. Other files do not need to
include debugfs.h or fs.h, as they may have needed to in the past.

Remove them from the files that do not need them.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

3efb5f21

20 11月, 2014 1 次提交

ring-buffer: Remove check of trace_seq_{puts,printf}() return values · c0cd93aa

由 Steven Rostedt (Red Hat) 提交于 11月 12, 2014

Remove checking the return value of all trace_seq_puts(). It was wrong
anyway as only the last return value mattered. But as the trace_seq_puts()
is going to be a void function in the future, we should not be checking
the return value of it anyway.

Just return !trace_seq_has_overflowed() instead.
Reviewed-by: NPetr Mladek <pmladek@suse.cz>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

c0cd93aa