提交 · 107f8bdac992356b3a80d41c9f6ff4399159aa81 · openeuler / raspberrypi-kernel

29 3月, 2012 1 次提交

padata: Add a reference to the api documentation · 107f8bda

由 Steffen Klassert 提交于 3月 28, 2012

Add a reference to the padata api documentation at Documentation/padata.txt
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

107f8bda

22 3月, 2012 5 次提交

memcg: let css_get_next() rely upon rcu_read_lock() · ca464d69

由 Hugh Dickins 提交于 3月 21, 2012

Remove lock and unlock around css_get_next()'s call to idr_get_next().
memcg iterators (only users of css_get_next) already did rcu_read_lock(),
and its comment demands that; but add a WARN_ON_ONCE to make sure of it.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ca464d69

cgroup: revert ss_id_lock to spinlock · 42aee6c4

由 Hugh Dickins 提交于 3月 21, 2012

Commit c1e2ee2d ("memcg: replace ss->id_lock with a rwlock") has now
been seen to cause the unfair behavior we should have expected from
converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose
get_new_cssid() is waiting for the wlock, while there are 19 tasks using
the rlock in css_get_next() to get on with their memcg workload (in an
artificial test, admittedly).  Yet lib/idr.c was made suitable for RCU
way back: revert that commit, restoring ss->id_lock to a spinlock.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

42aee6c4

mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() · 05af2e10

由 David Rientjes 提交于 3月 21, 2012

sync_mm_rss() can only be used for current to avoid race conditions in
iterating and clearing its per-task counters. Remove the task argument
for it and its helper function, __sync_task_rss_stat(), to avoid thinking
it can be used safely for anything other than current.
Signed-off-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

05af2e10

cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87

由 Mel Gorman 提交于 3月 21, 2012

Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths.  This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32.  The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side.  This is much cheaper on some architectures, including x86.  The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk.  If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
actual results were

                             3.3.0-rc3          3.3.0-rc3
                             rc3-vanilla        nobarrier-v2r1
    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds)             135.68    132.17
    User+Sys Time Running Test (seconds)         164.2    160.13
    Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected).  The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals.  The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data.  In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cc9a6c87

mm: add rss counters consistency check · c3f0327f

由 Konstantin Khlebnikov 提交于 3月 21, 2012

Warn about non-zero rss counters at final mmdrop.

This check will prevent reoccurences of bugs such as that fixed in "mm:
fix rss count leakage during migration".

I didn't hide this check under CONFIG_VM_DEBUG because it rather small and
rss counters cover whole page-table management, so this is a good
invariant.
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c3f0327f

21 3月, 2012 6 次提交

A
constify path argument of trace_seq_path() · 38eff289
由 Al Viro 提交于 3月 14, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
38eff289
A
constify path argument of audit_log_d_path() · 66b3fad3
由 Al Viro 提交于 3月 14, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
66b3fad3
A
switch open-coded instances of d_make_root() to new helper · 48fde701
由 Al Viro 提交于 1月 08, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
48fde701

exit_signal: fix the "parent has changed security domain" logic · b6e238dc

由 Oleg Nesterov 提交于 3月 19, 2012

exit_notify() changes ->exit_signal if the parent already did exec.
This doesn't really work, we are not going to send the signal now
if there is another live thread or the exiting task is traced. The
parent can exec before the last dies or the tracer detaches.

Move this check into do_notify_parent() which actually sends the
signal.

The user-visible change is that we do not change ->exit_signal,
and thus the exiting task is still "clone children" for
do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the
current logic is racy anyway.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b6e238dc

exit_signal: simplify the "we have changed execution domain" logic · e6368253

由 Oleg Nesterov 提交于 3月 19, 2012

exit_notify() checks "tsk->self_exec_id != tsk->parent_exec_id"
to handle the "we have changed execution domain" case.

We can change do_thread() to always set ->exit_signal = SIGCHLD
and remove this check to simplify the code.

We could change setup_new_exec() instead, this looks more logical
because it increments ->self_exec_id. But note that de_thread()
already resets ->exit_signal if it changes the leader, let's keep
both changes close to each other.

Note that we change ->exit_signal lockless, this changes the rules.
Thereafter ->exit_signal is not stable under tasklist but this is
fine, the only possible change is OLDSIG -> SIGCHLD. This can race
with eligible_child() but the race is harmless. We can race with
reparent_leader() which changes our ->exit_signal in parallel, but
it does the same change to SIGCHLD.

The noticeable user-visible change is that the execing task is not
"visible" to do_wait()->eligible_child(__WCLONE) right after exec.
To me this looks more logical, and this is consistent with mt case.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e6368253

CLONE_PARENT shouldn't allow to set ->exit_signal · 5f8aadd8

由 Oleg Nesterov 提交于 3月 14, 2012

The child must not control its ->exit_signal, it is the parent who
decides which signal the child should use for notification.

This means that CLONE_PARENT should not use "clone_flags & CSIGNAL",
the forking task is the sibling of the new process and their parent
doesn't control exit_signal in this case.

This patch uses ->exit_signal of the forking process, but perhaps
we should simply use SIGCHLD.

We read group_leader->exit_signal lockless, this can race with the
ORIGINAL_SIGNAL -> SIGCHLD transition, but this is fine.

Potentially this change allows to kill self_exec_id/parent_exec_id.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f8aadd8

20 3月, 2012 2 次提交
- C
  power: remove the second argument of k[un]map_atomic() · 0de9a1e2
  由 Cong Wang 提交于 11月 25, 2011
```
Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: NCong Wang <amwang@redhat.com>
```
  0de9a1e2
- C
  kdb: remove the second argument of k[un]map_atomic() · d762a50b
  由 Cong Wang 提交于 11月 25, 2011
```
Signed-off-by: NCong Wang <amwang@redhat.com>
```
  d762a50b
16 3月, 2012 3 次提交

genirq: Remove paranoid warnons and bogus fixups · e04268b0

由 Thomas Gleixner 提交于 3月 15, 2012

Alexander pointed out that the warnons in the regular exit path are
bogus and the thread_mask one actually could be triggered when
__setup_irq() hands out that thread_mask again after __free_irq()
dropped irq_desc->lock.

Thinking more about it, neither IRQTF_RUNTHREAD nor the bit in
thread_mask can be set as this is the regular exit path. We come here
due to:
	__free_irq()
	   remove action from desc
	   synchronize_irq()
	   kthread_stop()

So synchronize_irq() makes sure that the thread finished running and
cleaned up both the thread_active count and thread_mask. After that
point nothing can set IRQTF_RUNTHREAD on this action. So the warnons
and the cleanups are pointless.
Reported-by: NAlexander Gordeev <agordeev@redhat.com>
Cc: Ido Yariv <ido@wizery.com>
Link: http://lkml.kernel.org/r/20120315190755.GA6732@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

e04268b0

prctl: use CAP_SYS_RESOURCE for PR_SET_MM option · 79f0713d

由 Cyrill Gorcunov 提交于 3月 15, 2012

CAP_SYS_ADMIN is already overloaded left and right, so to have more
fine-grained access control use CAP_SYS_RESOURCE here.

The CAP_SYS_RESOUCE is chosen because this prctl option allows a current
process to adjust some fields of memory map descriptor which rather
represents what the process owns: pointers to code, data, stack
segments, command line, auxiliary vector data and etc.
Suggested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Acked-by: NKees Cook <keescook@chromium.org>
Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

79f0713d

ntp: Fix integer overflow when setting time · a078c6d0

由 Sasha Levin 提交于 3月 15, 2012

'long secs' is passed as divisor to div_s64, which accepts a 32bit
divisor. On 64bit machines that value is trimmed back from 8 bytes
back to 4, causing a divide by zero when the number is bigger than
(1 << 32) - 1 and all 32 lower bits are 0.

Use div64_long() instead.
Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
Cc: johnstul@us.ibm.com
Link: http://lkml.kernel.org/r/1331829374-31543-2-git-send-email-levinsasha928@gmail.com
Cc: stable@vger.kernel.org
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

a078c6d0

15 3月, 2012 1 次提交

printk: Make it compile with !CONFIG_PRINTK · 600e1458

由 Peter Zijlstra 提交于 3月 15, 2012

Commit 3ccf3e83 ("printk/sched: Introduce special
printk_sched() for those awkward moments") overlooked
an #ifdef, so move code around to respect these directives.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/1331811337.18960.179.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

600e1458

14 3月, 2012 3 次提交

genirq: Flush the irq thread on synchronization · 7140ea19

由 Ido Yariv 提交于 12月 02, 2011

The current implementation does not always flush the threaded handler
when disabling the irq. In case the irq handler was called, but the
threaded handler hasn't started running yet, the interrupt will be
flagged as pending, and the handler will not run. This implementation
has some issues:

First, if the interrupt is a wake source and flagged as pending, the
system will not be able to suspend.

Second, when quickly disabling and re-enabling the irq, the threaded
handler might continue to run after the irq is re-enabled without the
irq handler being called first. This might be an unexpected behavior.

In addition, it might be counter-intuitive that the threaded handler
will not be called even though the irq handler was called and returned
IRQ_WAKE_THREAD.

Fix this by always waiting for the threaded handler to complete in
synchronize_irq().

[ tglx: Massaged comments, added WARN_ONs and the missing
  	IRQTF_RUNTHREAD check in exit_irq_thread() ]
Signed-off-by: NIdo Yariv <ido@wizery.com>
Link: http://lkml.kernel.org/r/1322843052-7166-1-git-send-email-ido@wizery.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

7140ea19

padata: Fix race on sequence number wrap · 2dc9b5db

由 Steffen Klassert 提交于 3月 09, 2012

When padata_do_parallel() is called from multiple cpus for the same
padata instance, we can get object reordering on sequence number wrap
because testing for sequence number wrap and reseting the sequence
number must happen atomically but is implemented with two atomic
operations. This patch fixes this by converting the sequence number
from atomic_t to an unsigned int and protect the access with a
spin_lock. As a side effect, we get rid of the sequence number wrap
handling because the seqence number wraps back to null now without
the need to do anything.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

2dc9b5db

padata: Fix race in the serialization path · 3047817b

由 Steffen Klassert 提交于 3月 09, 2012

When a padata object is queued to the serialization queue, another
cpu might process and free the padata object. So don't dereference
it after queueing to the serialization queue.
Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

3047817b

13 3月, 2012 7 次提交

sched: Fix nohz load accounting -- again! · c308b56b

由 Peter Zijlstra 提交于 3月 01, 2012

Various people reported nohz load tracking still being wrecked, but Doug
spotted the actual problem. We fold the nohz remainder in too soon,
causing us to loose samples and under-account.

So instead of playing catch-up up-front, always do a single load-fold
with whatever state we encounter and only then fold the nohz remainder
and play catch-up.
Reported-by: NDoug Smythies <dsmythies@telus.net>
Reported-by: NLesÅ=82aw Kope=C4=87 <leslaw.kopec@nasza-klasa.pl>
Reported-by: NAman Gupta <aman@tmm1.net>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-4v31etnhgg9kwd6ocgx3rxl8@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

c308b56b

sched: Update yield() docs · 8e3fabfd

由 Peter Zijlstra 提交于 3月 06, 2012

Suggested-by: NJoe Perches <joe@perches.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1331056466.11248.327.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

8e3fabfd

printk/sched: Introduce special printk_sched() for those awkward moments · 3ccf3e83

由 Peter Zijlstra 提交于 2月 27, 2012

There's a few awkward printk()s inside of scheduler guts that people
prefer to keep but really are rather deadlock prone. Fudge around it
by storing the text in a per-cpu buffer and poll it using the existing
printk_tick() handler.

This will drop output when its more frequent than once a tick, however
only the affinity thing could possible go that fast and for that just
one should suffice to notify the admin he's done something silly..
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

3ccf3e83

sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer · 554cecaf

由 Diwakar Tundlam 提交于 3月 07, 2012

The 'next_balance' field of 'nohz' idle balancer must be initialized
to jiffies. Since jiffies is initialized to negative 300 seconds the
'nohz' idle balancer does not run for the first 300s (5mins) after
bootup. If no new processes are spawed or no idle cycles happen, the
load on the cpus will remain unbalanced for that duration.
Signed-off-by: NDiwakar Tundlam <dtundlam@nvidia.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1DD7BFEDD3147247B1355BEFEFE4665237994F30EF@HQMAIL04.nvidia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

554cecaf

sched: Cleanup cpu_active madness · 5fbd036b

由 Peter Zijlstra 提交于 12月 15, 2011

Stepan found:

CPU0		CPUn

_cpu_up()
  __cpu_up()

		boostrap()
		  notify_cpu_starting()
		  set_cpu_online()
		  while (!cpu_active())
		    cpu_relax()

<PREEMPT-out>

smp_call_function(.wait=1)
  /* we find cpu_online() is true */
  arch_send_call_function_ipi_mask()

  /* wait-forever-more */

<PREEMPT-in>
		  local_irq_enable()

  cpu_notify(CPU_ONLINE)
    sched_cpu_active()
      set_cpu_active()

Now the purpose of cpu_active is mostly with bringing down a cpu, where
we mark it !active to avoid the load-balancer from moving tasks to it
while we tear down the cpu. This is required because we only update the
sched_domain tree after we brought the cpu-down. And this is needed so
that some tasks can still run while we bring it down, we just don't want
new tasks to appear.

On cpu-up however the sched_domain tree doesn't yet include the new cpu,
so its invisible to the load-balancer, regardless of the active state.
So instead of setting the active state after we boot the new cpu (and
consequently having to wait for it before enabling interrupts) set the
cpu active before we set it online and avoid the whole mess.
Reported-by: NStepan Moskovchenko <stepanm@codeaurora.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

5fbd036b

sched: Fix load-balance wreckage · 5d6523eb

由 Peter Zijlstra 提交于 3月 10, 2012

Commit 367456c7 ("sched: Ditch per cgroup task lists for
load-balancing") completely wrecked load-balancing due to
a few silly mistakes.

Correct those and remove more pointless code.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-zk04ihygwxn7qqrlpaf73b0r@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

5d6523eb

workqueue: use percpu allocator for cwq on UP · e06ffa1e

由 Lai Jiangshan 提交于 3月 09, 2012

I notice that the commit bbddff makes percpu allocator can work on UP,
So we don't need the magic way for UP.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

e06ffa1e

10 3月, 2012 4 次提交

genirq: Get rid of unnecessary IRQTF_DIED flag · 5234ffb9

由 Alexander Gordeev 提交于 3月 09, 2012

Currently IRQTF_DIED flag is set when a IRQ thread handler calls do_exit()
But also PF_EXITING per process flag gets set when a thread exits. This
fix eliminates the duplicate by using PF_EXITING flag.

Also, there is a race condition in exit_irq_thread(). In case a thread's
bit is cleared in desc->threads_oneshot (and the IRQ line gets unmasked),
but before IRQTF_DIED flag is set, a new interrupt might come in and set
just cleared bit again, this time forever. This fix throws IRQTF_DIED flag
away, eliminating the race as a result.

[ tglx: Test THREAD_EXITING first as suggested by Oleg ]
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135958.GD2114@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

5234ffb9

genirq: No need to check IRQTF_DIED before stopping a thread handler · 05d74efa

由 Alexander Gordeev 提交于 3月 09, 2012

Since 63706172 kthread_stop() is not
afraid of dead kernel threads. So no need to check if a thread is
alive before stopping it. These checks still were racy.
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135939.GC2114@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

05d74efa

genirq: Get rid of unnecessary irqaction field in task_struct · 4bcdf1d0

由 Alexander Gordeev 提交于 3月 09, 2012

When a new thread handler is created, an irqaction is passed to it as
data. Not only that irqaction is stored in task_struct by the handler
for later use, but also a structure associated with the kernel thread
keeps this value as long as the thread exists.

This fix kicks irqaction out off task_struct. Yes, I introduce new bit
field. But it allows not only to eliminate the duplicate, but also
shortens size of task_struct.
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120309135925.GB2114@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

4bcdf1d0

genirq: Fix incorrect check for forced IRQ thread handler · 540b60e2

由 Alexander Gordeev 提交于 3月 09, 2012

We do not want a bitwise AND between boolean operands
Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20120309135912.GA2114@dhcp-26-207.brq.redhat.com
Cc: stable@vger.kernel.org
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

540b60e2

08 3月, 2012 1 次提交

Revert "CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resume" · 4293f20c

由 Linus Torvalds 提交于 3月 07, 2012

This reverts commit 8f2f748b.

It causes some odd regression that we have not figured out, and it's too
late in the -rc series to try to figure it out now.

As reported by Konstantin Khlebnikov, it causes consistent hangs on his
laptop (Thinkpad x220: 2x cores + HT).  They can be avoided by adding
calls to "rebuild_sched_domains();" in cpuset_cpu_[in]active() for the
CPU_{ONLINE/DOWN_FAILED/DOWN_PREPARE}_FROZEN cases, but it's not at all
clear why, and it makes no sense.

Konstantin's config doesn't even have CONFIG_CPUSETS enabled, just to
make things even more interesting.  So it's not the cpusets, it's just
the scheduling domains.

So until this is understood, revert.
Bisected-reported-and-tested-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4293f20c

07 3月, 2012 1 次提交

genirq: Clear action->thread_mask if IRQ_ONESHOT is not set · 52abb700

由 Thomas Gleixner 提交于 3月 06, 2012

Xommit ac563761(genirq: Unmask oneshot irqs when thread was not woken)
fails to unmask when a !IRQ_ONESHOT threaded handler is handled by
handle_level_irq.

This happens because thread_mask is or'ed unconditionally in
irq_wake_thread(), but for !IRQ_ONESHOT interrupts never cleared.  So
the check for !desc->thread_active fails and keeps the interrupt
disabled.

Keep the thread_mask zero for !IRQ_ONESHOT interrupts.

Document the thread_mask magic while at it.
Reported-and-tested-by: NSven Joachim <svenjoac@gmx.de>
Reported-and-tested-by: NStefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: stable@vger.kernel.org
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52abb700

06 3月, 2012 6 次提交

softirq: Reduce invoke_softirq() code duplication · b2a00178

由 Heiko Carstens 提交于 3月 05, 2012

The two invoke_softirq() variants are identical except for a single
line. So move the #ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED inside one of
the functions and get rid of the other one.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

b2a00178

genirq: Fix long-term regression in genirq irq_set_irq_type() handling · a09b659c

由 Russell King 提交于 3月 05, 2012

In 2008, commit 0c5d1eb7 ("genirq: record trigger type") modified the
way set_irq_type() handles the 'no trigger' condition.  However, this has
an adverse effect on PCMCIA support on Intel StrongARM and probably PXA
platforms.

PCMCIA has several status signals on the socket which can trigger
interrupts; some of these status signals depend on the card's mode
(whether it is configured in memory or IO mode).  For example, cards have
a 'Ready/IRQ' signal: in memory mode, this provides an indication to
PCMCIA that the card has finished its power up initialization.  In IO
mode, it provides the device interrupt signal.  Other status signals
switch between on-board battery status and loud speaker output.

In classical PCMCIA implementations, where you have a specific socket
controller, the controller provides a method to mask interrupts from the
socket, and importantly ignore any state transitions on the pins which
correspond with interrupts once masked.  This masking prevents unwanted
events caused by the removal and application of socket power being
forwarded.

However, on platforms where there is no socket controller, the PCMCIA
status and interrupt signals are routed to standard edge-triggered GPIOs. 
These GPIOs can be configured to interrupt on rising edge, falling edge,
or never.  This is where the problems start.

Edge triggered interrupts are required to record events while disabled via
the usual methods of {free,request,disable,enable}_irq() to prevent
problems with dropped interrupts (eg, the 8390 driver uses disable_irq()
to defer the delivery of interrupts).  As a result, these interfaces can
not be used to implement the desired behaviour.

The side effect of this is that if the 'Ready/IRQ' GPIO is disabled via
disable_irq() on suspend, and enabled via enable_irq() after resume, we
will record the state transitions caused by powering events as valid
interrupts, and foward them to the card driver, which may attempt to
access a card which is not powered up.

This leads delays resume while drivers spin in their interrupt handlers,
and complaints from drivers before they realize what's happened.

Moreover, in the case of the 'Ready/IRQ' signal, this is requested and
freed by the card driver itself; the PCMCIA core has no idea whether the
interrupt is requested, and, therefore, whether a call to disable_irq()
would be valid.  (We tried this around 2.4.17 / 2.5.1 kernel era, and
ended up throwing it out because of this problem.)

Therefore, it was decided back in around 2002 to disable the edge
triggering instead, resulting in all state transitions on the GPIO being
ignored.  That's what we actually need the hardware to do.

The commit above changes this behaviour; it explicitly prevents the 'no
trigger' state being selected.

The reason that request_irq() does not accept the 'no trigger' state is
for compatibility with existing drivers which do not provide their desired
triggering configuration.  The set_irq_type() function is 'new' and not
used by non-trigger aware drivers.

Therefore, revert this change, and restore previously working platforms
back to their former state.
Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
Cc: linux@arm.linux.org.uk
Cc: Ingo Molnar <mingo@elte.hu>
Cc: stable@vger.kernel.org
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

a09b659c

hung_task: fix the broken rcu_lock_break() logic · 6027ce49

由 Oleg Nesterov 提交于 3月 05, 2012

check_hung_uninterruptible_tasks()->rcu_lock_break() introduced by
"softlockup: check all tasks in hung_task" commit ce9dbe24 looks
absolutely wrong.

	- rcu_lock_break() does put_task_struct(). If the task has exited
	  it is not safe to even read its ->state, nothing protects this
	  task_struct.

	- The TASK_DEAD checks are wrong too. Contrary to the comment, we
	  can't use it to check if the task was unhashed. It can be unhashed
	  without TASK_DEAD, or it can be valid with TASK_DEAD.

	  For example, an autoreaping task can do release_task(current)
	  long before it sets TASK_DEAD in do_exit().

	  Or, a zombie task can have ->state == TASK_DEAD but release_task()
	  was not called, and in this case we must not break the loop.

Change this code to check pid_alive() instead, and do this before we drop
the reference to the task_struct.

Note: while_each_thread() under rcu_read_lock() is not really safe, it can
livelock.  This will be fixed later, but fortunately in this case the
"max_count" logic saves us anyway.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NMandeep Singh Baines <msb@google.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6027ce49

vfork: kill PF_STARTING · 6e27f63e

由 Oleg Nesterov 提交于 3月 05, 2012

Previously it was (ab)used by utrace.  Then it was wrongly used by the
scheduler code.

Currently it is not used, kill it before it finds the new erroneous user.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6e27f63e

coredump_wait: don't call complete_vfork_done() · 57b59c4a

由 Oleg Nesterov 提交于 3月 05, 2012

Now that CLONE_VFORK is killable, coredump_wait() no longer needs
complete_vfork_done().  zap_threads() should find and kill all tasks with
the same ->mm, this includes our parent if ->vfork_done is set.

mm_release() becomes the only caller, unexport complete_vfork_done().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57b59c4a

vfork: make it killable · d68b46fe

由 Oleg Nesterov 提交于 3月 05, 2012

Make vfork() killable.

Change do_fork(CLONE_VFORK) to do wait_for_completion_killable().  If it
fails we do not return to the user-mode and never touch the memory shared
with our child.

However, in this case we should clear child->vfork_done before return, we
use task_lock() in do_fork()->wait_for_vfork_done() and
complete_vfork_done() to serialize with each other.

Note: now that we use task_lock() we don't really need completion, we
could turn task->vfork_done into "task_struct *wake_up_me" but this needs
some complications.

NOTE: this and the next patches do not affect in-kernel users of
CLONE_VFORK, kernel threads run with all signals ignored including
SIGKILL/SIGSTOP.

However this is obviously the user-visible change.  Not only a fatal
signal can kill the vforking parent, a sub-thread can do execve or
exit_group() and kill the thread sleeping in vfork().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d68b46fe