提交 · a8941b0ed0f1e39a4d41560c3a2e7ee37d5b6e10 · openanolis / cloud-kernel

03 4月, 2010 4 次提交

kgdb: Turn off tracing while in the debugger · 4da75b9c

由 Jason Wessel 提交于 4月 02, 2010

The kernel debugger should turn off kernel tracing any time the
debugger is active and restore it on resume.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>

4da75b9c

kgdb: use atomic_inc and atomic_dec instead of atomic_set · ae6bf53e

由 Jason Wessel 提交于 4月 02, 2010

Memory barriers should be used for the kgdb cpu synchronization.  The
atomic_set() does not imply a memory barrier.
Reported-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

ae6bf53e

kgdb: eliminate kgdb_wait(), all cpus enter the same way · 62fae312

由 Jason Wessel 提交于 4月 02, 2010

This is a kgdb architectural change to have all the cpus (master or
slave) enter the same function.

A cpu that hits an exception (wants to be the master cpu) will call
kgdb_handle_exception() from the trap handler and then invoke a
kgdb_roundup_cpu() to synchronize the other cpus and bring them into
the kgdb_handle_exception() as well.

A slave cpu will enter kgdb_handle_exception() from the
kgdb_nmicallback() and set the exception state to note that the
processor is a slave.

Previously the salve cpu would have called kgdb_wait().  This change
allows the debug core to change cpus without resuming the system in
order to inspect arch specific cpu information.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

62fae312

kgdb: have ebin2mem call probe_kernel_write once · a0279bd5

由 Jason Wessel 提交于 4月 02, 2010

Rather than call probe_kernel_write() one byte at a time, process the
whole buffer locally and pass the entire result in one go.  This way,
architectures that need to do special handling based on the length can
do so, or we only end up calling memcpy() once.

[sonic.zhang@analog.com: Reported original problem and preliminary patch]
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
Signed-off-by: NSonic Zhang <sonic.zhang@analog.com>
Signed-off-by: NMike Frysinger <vapier@gentoo.org>

a0279bd5

30 3月, 2010 6 次提交

CRED: Fix memory leak in error handling · 570b8fb5

由 Mathieu Desnoyers 提交于 3月 30, 2010

Fix a memory leak on an OOM condition in prepare_usermodehelper_creds().
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

570b8fb5

ring-buffer: Add missing unlock · 292f60c0

由 Julia Lawall 提交于 3月 29, 2010

In some error handling cases the lock is not unlocked.  The return is
converted to a goto, to share the unlock at the end of the function.

A simplified version of the semantic patch that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
expression E1;
identifier f;
@@

f (...) { <+...
* spin_lock_irq (E1,...);
... when != E1
* return ...;
...+> }
// </smpl>
Signed-off-by: NJulia Lawall <julia@diku.dk>
LKML-Reference: <Pine.LNX.4.64.1003291736440.21896@ask.diku.dk>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

292f60c0

tracing: Fix lockdep warning in global_clock() · e36673ec

由 Li Zefan 提交于 3月 24, 2010

# echo 1 > events/enable
 # echo global > trace_clock

------------[ cut here ]------------
WARNING: at kernel/lockdep.c:3162 check_flags+0xb2/0x190()
...
---[ end trace 3f86734a89416623 ]---
possible reason: unannotated irqs-on.
...

There's no reason to use the raw_local_irq_save() in trace_clock_global.
The local_irq_save() version is fine, and does not cause the bug in lockdep.
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BA97FA1.7030606@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

e36673ec

x86: Do not free zero sized per cpu areas · eed63519

由 Ian Campbell 提交于 3月 28, 2010

This avoids an infinite loop in free_early_partial().

Add a warning to free_early_partial() to catch future problems.

-v5: put back start > end back into WARN_ONCE()
-v6: use one line for warning, suggested by Linus
-v7: more tests
-v8: remove the function name as suggested by Johannes
     WARN_ONCE() will print out that function name.
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: NJoel Becker <joel.becker@oracle.com>
Tested-by: NStanislaw Gruszka <sgruszka@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Miller <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1269830604-26214-4-git-send-email-yinghai@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

eed63519

SLOW_WORK: CONFIG_SLOW_WORK_PROC should be CONFIG_SLOW_WORK_DEBUG · a53f4f9e

由 David Howells 提交于 3月 29, 2010

CONFIG_SLOW_WORK_PROC was changed to CONFIG_SLOW_WORK_DEBUG, but not in all
instances. Change the remaining instances. This makes the debugfs file
display the time mark and the owner's description again.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a53f4f9e

slow-work: use get_ref wrapper instead of directly calling get_ref · 88be12c4

由 Dave Airlie 提交于 3月 29, 2010

Otherwise we can get an oops if the user has no get_ref/put_ref
requirement.
Signed-off-by: NDave Airlie <airlied@redhat.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88be12c4

27 3月, 2010 2 次提交

Freezer: Fix buggy resume test for tasks frozen with cgroup freezer · 5a7aadfe

由 Matt Helsley 提交于 3月 26, 2010

When the cgroup freezer is used to freeze tasks we do not want to thaw
those tasks during resume. Currently we test the cgroup freezer
state of the resuming tasks to see if the cgroup is FROZEN. If so
then we don't thaw the task. However, the FREEZING state also indicates
that the task should remain frozen.

This also avoids a problem pointed out by Oren Ladaan: the freezer state
transition from FREEZING to FROZEN is updated lazily when userspace reads
or writes the freezer.state file in the cgroup filesystem. This means that
resume will thaw tasks in cgroups which should be in the FROZEN state if
there is no read/write of the freezer.state file to trigger this
transition before suspend.

NOTE: Another "simple" solution would be to always update the cgroup
freezer state during resume. However it's a bad choice for several reasons:
Updating the cgroup freezer state is somewhat expensive because it requires
walking all the tasks in the cgroup and checking if they are each frozen.
Worse, this could easily make resume run in N^2 time where N is the number
of tasks in the cgroup. Finally, updating the freezer state from this code
path requires trickier locking because of the way locks must be ordered.

Instead of updating the freezer state we rely on the fact that lazy
updates only manage the transition from FREEZING to FROZEN. We know that
a cgroup with the FREEZING state may actually be FROZEN so test for that
state too. This makes sense in the resume path even for partially-frozen
cgroups -- those that really are FREEZING but not FROZEN.
Reported-by: NOren Ladaan <orenl@cs.columbia.edu>
Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
Cc: stable@kernel.org
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

5a7aadfe

Freezer: Only show the state of tasks refusing to freeze · 4f598458

由 Xiaotian Feng 提交于 3月 10, 2010

show_state will dump all tasks state, so if freezer failed to freeze
any task, kernel will dump all tasks state and flood the dmesg log.
This patch makes freezer only show state of tasks refusing to freeze.
Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
Acked-by: NPavel Machek <pavel@ucw.cz>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

4f598458

25 3月, 2010 3 次提交

cpuset: alloc nodemask_t on the heap rather than the stack · 53feb297

由 Miao Xie 提交于 3月 23, 2010

Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

53feb297

cpuset: fix the problem that cpuset_mem_spread_node() returns an offline node · 5ab116c9

由 Miao Xie 提交于 3月 23, 2010

cpuset_mem_spread_node() returns an offline node, and causes an oops.

This patch fixes it by initializing task->mems_allowed to
node_states[N_HIGH_MEMORY], and updating task->mems_allowed when doing
memory hotplug.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Reported-by: NNick Piggin <npiggin@suse.de>
Tested-by: NNick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ab116c9

cgroups: remove duplicate include · 9d34706f

由 Li Zefan 提交于 3月 23, 2010

commit e6a1105b ("cgroups: subsystem module loading interface") and commit
c50cc752 ("sched, cgroups: Fix module export") result in duplicate
including of module.h
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Acked-by: NPaul Menage <menage@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9d34706f

24 3月, 2010 3 次提交

genirq: Move two IRQ functions from .init.text to .text · 860652bf

由 Henrik Kretzschmar 提交于 3月 24, 2010

Both functions should not be marked as __init, since they be called
from modules after the init section is freed.
Signed-off-by: NHenrik Kretzschmar <henne@nachtwindheim.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jkosina@suse.cz>
LKML-Reference: <1269431961-5731-1-git-send-email-henne@nachtwindheim.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

860652bf

genirq: Protect access to irq_desc->action in can_request_irq() · cc8c3b78

由 Thomas Gleixner 提交于 3月 23, 2010

can_request_irq() accesses and dereferences irq_desc->action w/o
holding irq_desc->lock. So action can be freed on another CPU before
it's dereferenced. Unlikely, but ...

Protect it with desc->lock.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

cc8c3b78

resources: add interfaces that return conflict information · 66f1207b

由 Bjorn Helgaas 提交于 3月 11, 2010

request_resource() and insert_resource() only return success or failure,
which no information about what existing resource conflicted with the
proposed new reservation.  This patch adds request_resource_conflict()
and insert_resource_conflict(), which return the conflicting resource.

Callers may use this for better error messages or to adjust the new
resource and retry the request.
Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>

66f1207b

23 3月, 2010 1 次提交

time: Fix accumulation bug triggered by long delay. · 830ec045

由 John Stultz 提交于 3月 18, 2010

The logarithmic accumulation done in the timekeeping has some overflow
protection that limits the max shift value. That means it will take
more then shift loops to accumulate all of the cycles. This causes
the shift decrement to underflow, which causes the loop to never exit.

The simplest fix would be simply to do a:
	if (shift)
		shift--;

However that is not optimal, as we know the cycle offset is larger
then the interval << shift, the above would make shift drop to zero,
then we would be spinning for quite awhile accumulating at interval
chunks at a time.

Instead, this patch only decreases shift if the offset is smaller
then cycle_interval << shift.  This makes sure we accumulate using
the largest chunks possible without overflowing tick_length, and limits
the number of iterations through the loop.

This issue was found and reported by Sonic Zhang, who also tested the fix.
Many thanks your explanation and testing!
Reported-by: NSonic Zhang <sonic.adi@gmail.com>
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Tested-by: NSonic Zhang <sonic.adi@gmail.com>
LKML-Reference: <1268948850-5225-1-git-send-email-johnstul@us.ibm.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

830ec045

22 3月, 2010 1 次提交

softlockup: Stop spurious softlockup messages due to overflow · 8c2eb480

由 Colin Ian King 提交于 3月 19, 2010

Ensure additions on touch_ts do not overflow.  This can occur
when the top 32 bits of the TSC reach 0xffffffff causing
additions to touch_ts to overflow and this in turn generates
spurious softlockup warnings.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <1268994482.1798.6.camel@lenovo>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8c2eb480

19 3月, 2010 1 次提交

ring-buffer: Do 8 byte alignment for 64 bit that can not handle 4 byte align · 2271048d

由 Steven Rostedt 提交于 3月 18, 2010

The ring buffer uses 4 byte alignment while recording events into the
buffer, even on 64bit machines. This saves space when there are lots
of events being recorded at 4 byte boundaries.

The ring buffer has a zero copy method to write into the buffer, with
the reserving of space and then committing it. This may cause problems
when writing an 8 byte word into a 4 byte alignment (not 8). For x86 and
PPC this is not an issue, but on some architectures this would cause an
out-of-alignment exception.

This patch uses CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to determine
if it is OK to use 4 byte alignments on 64 bit machines. If it is not,
it forces the ring buffer event header to be 8 bytes and not 4,
and will align the length of the data to be 8 byte aligned.
This keeps the data payload at 8 byte alignments and will allow these
machines to run without issue.

The trick to this is that the header can be either 4 bytes or 8 bytes
depending on the length of the data payload. The 4 byte header
has a length field that supports up to 112 bytes. If the length of
the data is more than 112, the length field is set to zero, and the actual
length is stored in the next 4 bytes after the header.

When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the code forces
zero in the 4 byte header forcing the length to be stored in the 4 byte
array, even with a small data load. It also forces the length of the
data load to be 8 byte aligned. The combination of these two guarantee
that the data is always at 8 byte alignment.
Tested-by: NFrederic Weisbecker <fweisbec@gmail.com>
           (on sparc64)
Reported-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

2271048d

17 3月, 2010 2 次提交

perf: Fix unexported generic perf_arch_fetch_caller_regs · dcd5c166

由 Frederic Weisbecker 提交于 3月 16, 2010

perf_arch_fetch_caller_regs() is exported for the overriden x86
version, but not for the generic weak version.

As a general rule, weak functions should not have their symbol
exported in the same file they are defined.

So let's export it on trace_event_perf.c as it is used by trace
events only.

This fixes:

	ERROR: ".perf_arch_fetch_caller_regs" [fs/xfs/xfs.ko] undefined!
	ERROR: ".perf_arch_fetch_caller_regs" [arch/powerpc/platforms/cell/spufs/spufs.ko] undefined!

-v2: And also only build it if trace events are enabled.
-v3: Fix changelog mistake
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1268697902-9518-1-git-send-regression-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

dcd5c166

sched: Use proper type in sched_getaffinity() · 8bc037fb

由 KOSAKI Motohiro 提交于 3月 17, 2010

Using the proper type fixes the following compiler warning:

  kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: torvalds@linux-foundation.org
Cc: travis@sgi.com
Cc: peterz@infradead.org
Cc: drepper@redhat.com
Cc: rja@sgi.com
Cc: sharyath@in.ibm.com
Cc: steiner@sgi.com
LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8bc037fb

16 3月, 2010 2 次提交

kernel/sched.c: Suppress unused var warning · c890692b

由 Andrew Morton 提交于 3月 11, 2010

On UP:

  kernel/sched.c: In function 'wake_up_new_task':
  kernel/sched.c:2631: warning: unused variable 'cpu'
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c890692b

rcu: Make rcu_read_lock_bh_held() allow for disabled BH · e3818b8d

由 Paul E. McKenney 提交于 3月 15, 2010

Disabling BH can stand in for rcu_read_lock_bh(), and this patch
updates rcu_read_lock_bh_held() to allow for this.  In order to
avoid include-file hell, this function is moved out of line to
kernel/rcupdate.c.

This fixes a false positive RCU warning.
Reported-by: NArnd Bergmann <arnd@arndb.de>
Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <20100316000343.GA25857@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e3818b8d

15 3月, 2010 1 次提交

sched: sched_getaffinity(): Allow less than NR_CPUS length · cd3d8031

由 KOSAKI Motohiro 提交于 3月 12, 2010

[ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]

Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
Unfortunately, glibc sched interface has the following definition:

	# define __CPU_SETSIZE  1024
	# define __NCPUBITS     (8 * sizeof (__cpu_mask))
	typedef unsigned long int __cpu_mask;
	typedef struct
	{
	  __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
	} cpu_set_t;

It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
ABI issue ...

More recently, Sharyathi Nagesh reported following test program makes
misterious syscall failure:

 -----------------------------------------------------------------------
 #define _GNU_SOURCE
 #include<stdio.h>
 #include<errno.h>
 #include<sched.h>

 int main()
 {
     cpu_set_t set;
     if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
         printf("\n Call is failing with:%d", errno);
 }
 -----------------------------------------------------------------------

Because the kernel assumes len argument of sched_getaffinity() is bigger
than NR_CPUS. But now it is not correct.

Now we are faced with the following annoying dilemma, due to
the limitations of the glibc interface built in years ago:

 (1) if we change glibc's __CPU_SETSIZE definition, we lost
     binary compatibility of _all_ application.

 (2) if we don't change it, we also lost binary compatibility of
     Sharyathi's use case.

Then, I would propse to change the rule of the len argument of
sched_getaffinity().

Old:
	len should be bigger than NR_CPUS
New:
	len should be bigger than maximum possible cpu id

This creates the following behavior:

 (A) In the real 4096 cpus machine, the above test program still
     return -EINVAL.

 (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
     all machines in the world), the above can run successfully.

Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
they can rebuild their programs.

IOW we hope they are not annoyed by this issue ...
Reported-by: NSharyathi Nagesh <sharyath@in.ibm.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NUlrich Drepper <drepper@redhat.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cd3d8031

13 3月, 2010 14 次提交

tracing: Do not record user stack trace from NMI context · b6345879

由 Steven Rostedt 提交于 3月 12, 2010

A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Cc: stable@kernel.org
Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

b6345879

tracing: Disable buffer switching when starting or stopping trace · a2f80714

由 Steven Rostedt 提交于 3月 12, 2010

When the trace iterator is read, tracing_start() and tracing_stop()
is called to stop tracing while the iterator is processing the trace
output.

These functions disable both the standard buffer and the max latency
buffer. But if the wakeup tracer is running, it can switch these
buffers between the two disables:

  buffer = global_trace.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

      <<<--------- swap happens here

  buffer = max_tr.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

What happens is that we disabled the same buffer twice. On tracing_start()
we can enable the same buffer twice. All ring_buffer_record_disable()
must be matched with a ring_buffer_record_enable() or the buffer
can be disable permanently, or enable prematurely, and cause a bug
where a reset happens while a trace is commiting.

This patch protects these two by taking the ftrace_max_lock to prevent
a switch from occurring.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

a2f80714

tracing: Use same local variable when resetting the ring buffer · 283740c6

由 Steven Rostedt 提交于 3月 12, 2010

In the ftrace code that resets the ring buffer it references the
buffer with a local variable, but then uses the tr->buffer as the
parameter to reset. If the wakeup tracer is running, which can
switch the tr->buffer with the max saved buffer, this can break
the requirement of disabling the buffer before the reset.

   buffer = tr->buffer;
   ring_buffer_record_disable(buffer);
   synchronize_sched();
   __tracing_reset(tr->buffer, cpu);

If the tr->buffer is swapped, then the reset is not happening to the
buffer that was disabled. This will cause the ring buffer to fail.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

283740c6

function-graph: Init curr_ret_stack with ret_stack · ea14eb71

由 Steven Rostedt 提交于 3月 12, 2010

If the graph tracer is active, and a task is forked but the allocating of
the processes graph stack fails, it can cause crash later on.

This is due to the temporary stack being NULL, but the curr_ret_stack
variable is copied from the parent. If it is not -1, then in
ftrace_graph_probe_sched_switch() the following:

	for (index = next->curr_ret_stack; index >= 0; index--)
		next->ret_stack[index].calltime += timestamp;

Will cause a kernel OOPS.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

ea14eb71

ring-buffer: Move disabled check into preempt disable section · 52fbe9cd

由 Lai Jiangshan 提交于 3月 08, 2010

The ring buffer resizing and resetting relies on a schedule RCU
action. The buffers are disabled, a synchronize_sched() is called
and then the resize or reset takes place.

But this only works if the disabling of the buffers are within the
preempt disabled section, otherwise a window exists that the buffers
can be written to while a reset or resize takes place.

Cc: stable@kernel.org
Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B949E43.2010906@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

52fbe9cd

sysctl extern cleanup: lockdep · 2edf5e49

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move lockdep extern declarations to linux/lockdep.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2edf5e49

sysctl extern cleanup: rtmutex · 4f0e056f

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move max_lock_depth extern declaration to linux/rtmutex.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4f0e056f

sysctl extern cleanup: acct · c55b7c3e

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move acct_parm extern declaration to linux/acct.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c55b7c3e

sysctl extern cleanup: sg · 15485a46

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move sg_big_buff extern declaration to scsi/sg.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Acked-by: NDoug Gilbert <dgilbert@interlog.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

15485a46

sysctl extern cleanup: module · 5ed10910

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move modprobe_path extern declaration to linux/kmod.h
Move modules_disabled extern declaration to linux/module.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ed10910

sysctl extern cleanup: rcu · e5ab6772

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move rcutorture_runnable extern declaration to linux/rcupdate.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Acked-by: NJosh Triplett <josh@joshtriplett.org>
Reviewed-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e5ab6772

sysctl extern cleanup: signal · d33ed52d

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move print_fatal_signals extern declaration to linux/signal.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d33ed52d

sysctl extern cleanup: C_A_D · eb5572fe

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move C_A_D extern variable declaration to linux/reboot.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eb5572fe

nsproxy: remove INIT_NSPROXY() · 8467005d

由 Alexey Dobriyan 提交于 3月 10, 2010

Remove INIT_NSPROXY(), use C99 initializer.
Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it.

Note: headers trim will be done later, now it's quite pointless because
results will be invalidated by merge window.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8467005d

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功