提交 · 3d214faea6e4f9b6018bf8589f4b245126349c0a · openeuler / raspberrypi-kernel

30 10月, 2011 1 次提交

[S390] kdump: Add KEXEC_CRASH_CONTROL_MEMORY_LIMIT · 3d214fae

由 Michael Holzheu 提交于 10月 30, 2011

On s390 there is a different KEXEC_CONTROL_MEMORY_LIMIT for the normal and
the kdump kexec case. Therefore this patch introduces a new macro
KEXEC_CRASH_CONTROL_MEMORY_LIMIT. This is set to
KEXEC_CONTROL_MEMORY_LIMIT for all architectures that do not define
KEXEC_CRASH_CONTROL_MEMORY_LIMIT.
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMichael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

3d214fae

26 10月, 2011 2 次提交

params: make dashes and underscores in parameter names truly equal · b1e4d20c

由 Michal Schmidt 提交于 10月 10, 2011

The user may use "foo-bar" for a kernel parameter defined as "foo_bar".
Make sure it works the other way around too.

Apply the equality of dashes and underscores on early_params and __setup
params as well.

The example given in Documentation/kernel-parameters.txt indicates that
this is the intended behaviour.

With the patch the kernel accepts "log-buf-len=1M" as expected.
https://bugzilla.redhat.com/show_bug.cgi?id=744545Signed-off-by: NMichal Schmidt <mschmidt@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (neatened implementations)

b1e4d20c

kmod: prevent kmod_loop_msg overflow in __request_module() · 37252db6

由 Jiri Kosina 提交于 10月 26, 2011

Due to post-increment in condition of kmod_loop_msg in __request_module(),
the system log can be spammed by much more than 5 instances of the 'runaway
loop' message if the number of events triggering it makes the kmod_loop_msg
to overflow.

Fix that by making sure we never increment it past the threshold.
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
CC: stable@kernel.org

37252db6

24 10月, 2011 1 次提交

irq: Add EXPORT_SYMBOL_GPL to function of irq generic-chip · 825de2e9

由 Nobuhiro Iwamatsu 提交于 10月 17, 2011

Some functions of irq generic-chip is undefined, because
EXPORT_SYMBOL_GPL is not set to these.

ERROR: "irq_setup_generic_chip" [drivers/gpio/gpio-pch.ko] undefined!
ERROR: "irq_alloc_generic_chip" [drivers/gpio/gpio-pch.ko] undefined!
ERROR: "irq_setup_generic_chip" [drivers/gpio/gpio-ml-ioh.ko] undefined!
ERROR: "irq_alloc_generic_chip" [drivers/gpio/gpio-ml-ioh.ko] undefined!

This is revised that EXPORT_SYMBOL_GPL can be added and referred
to in functions.
Signed-off-by: NNobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

825de2e9

18 10月, 2011 1 次提交

cputimer: Cure lock inversion · bcd5cff7

由 Peter Zijlstra 提交于 10月 17, 2011

There's a lock inversion between the cputimer->lock and rq->lock;
notably the two callchains involved are:

 update_rlimit_cpu()
   sighand->siglock
   set_process_cpu_timer()
     cpu_timer_sample_group()
       thread_group_cputimer()
         cputimer->lock
         thread_group_cputime()
           task_sched_runtime()
             ->pi_lock
             rq->lock

 scheduler_tick()
   rq->lock
   task_tick_fair()
     update_curr()
       account_group_exec()
         cputimer->lock

Where the first one is enabling a CLOCK_PROCESS_CPUTIME_ID timer, and
the second one is keeping up-to-date.

This problem was introduced by e8abccb7 ("posix-cpu-timers: Cure
SMP accounting oddities").

Cure the problem by removing the cputimer->lock and rq->lock nesting,
this leaves concurrent enablers doing duplicate work, but the time
wasted should be on the same order otherwise wasted spinning on the
lock and the greater-than assignment filter should ensure we preserve
monotonicity.
Reported-by: NDave Jones <davej@redhat.com>
Reported-by: NSimon Kirby <sim@hostway.ca>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Link: http://lkml.kernel.org/r/1318928713.21167.4.camel@twinsSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

bcd5cff7

17 10月, 2011 14 次提交

Avoid using variable-length arrays in kernel/sys.c · a84a79e4

由 Linus Torvalds 提交于 10月 17, 2011

The size is always valid, but variable-length arrays generate worse code
for no good reason (unless the function happens to be inlined and the
compiler sees the length for the simple constant it is).

Also, there seems to be some code generation problem on POWER, where
Henrik Bakken reports that register r28 can get corrupted under some
subtle circumstances (interrupt happening at the wrong time?).  That all
indicates some seriously broken compiler issues, but since variable
length arrays are bad regardless, there's little point in trying to
chase it down.

"Just don't do that, then".
Reported-by: NHenrik Grindal Bakken <henribak@cisco.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a84a79e4

genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlier · 9bab0b7f

由 Ian Campbell 提交于 10月 03, 2011

This adds a mechanism to resume selected IRQs during syscore_resume
instead of dpm_resume_noirq.

Under Xen we need to resume IRQs associated with IPIs early enough
that the resched IPI is unmasked and we can therefore schedule
ourselves out of the stop_machine where the suspend/resume takes
place.

This issue was introduced by 676dc3cf "xen: Use IRQF_FORCE_RESUME".
Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk
Cc: stable@kernel.org (at least to 2.6.32.y)
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

9bab0b7f

PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image · 081a9d04

由 Bojan Smojver 提交于 10月 13, 2011

Use threads for LZO compression/decompression on hibernate/thaw.
Improve buffering on hibernate/thaw.
Calculate/verify CRC32 of the image pages on hibernate/thaw.

In my testing, this improved write/read speed by a factor of about two.
Signed-off-by: NBojan Smojver <bojan@rexursive.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

081a9d04

PM / Hibernate: Do not initialize static and extern variables to 0 · d231ff1a

由 Barry Song 提交于 10月 11, 2011

Static and extern variables in kernel/power/hibernate.c need not be
initialized to 0 explicitly, so remove those initializations.

[rjw: Modified subject, added changelog.]
Signed-off-by: NBarry Song <Baohua.Song@csr.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

d231ff1a

PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too · 27920651

由 Jeff Layton 提交于 10月 11, 2011

TASK_KILLABLE is often used to put tasks to sleep for quite some time.
One of the most common uses is to put tasks to sleep while waiting for
replies from a server on a networked filesystem (such as CIFS or NFS).

Unfortunately, fake_signal_wake_up does not currently wake up tasks
that are sleeping in TASK_KILLABLE state. This means that even if the
code were in place to allow them to freeze while in this sleep, it
wouldn't work anyway.

This patch changes this function to wake tasks in this state as well.
This should be harmless -- if the code doing the sleeping doesn't have
handling to deal with freezer events, it should just go back to sleep.
If it does, then this will allow that code to do the right thing.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

27920651

PM / Hibernate: Add resumedelay kernel param in addition to resumewait · f126f733

由 Barry Song 提交于 10月 10, 2011

Patch "PM / Hibernate: Add resumewait param to support MMC-like
devices as resume file" added the resumewait kernel command line
option.  The present patch adds resumedelay so that
resumewait/delay were analogous to rootwait/delay.

[rjw: Modified the subject and changelog slightly.]
Signed-off-by: NBarry Song <baohua.song@csr.com>
Acked-by: NPavel Machek <pavel@ucw.cz>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

f126f733

PM / Hibernate: Add resumewait param to support MMC-like devices as resume file · 6f8d7022

由 Barry Song 提交于 10月 06, 2011

Some devices like MMC are async detected very slow. For example,
drivers/mmc/host/sdhci.c launches a 200ms delayed work to detect
MMC partitions then add disk.

We have wait_for_device_probe() and scsi_complete_async_scans()
before calling swsusp_check(), but it is not enough to wait for MMC.

This patch adds resumewait kernel param just like rootwait so
that we have enough time to wait until MMC is ready. The difference is
that we wait for resume partition whereas rootwait waits for rootfs
partition (which may be on a different device).

This patch will make hibernation support many embedded products
without SCSI devices, but with devices like MMC.

[rjw: Modified the changelog slightly.]
Signed-off-by: NBarry Song <Baohua.Song@csr.com>
Reviewed-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

6f8d7022

PM / Hibernate: Fix typo in a kerneldoc comment · 21e82808

由 Barry Song 提交于 9月 27, 2011

Fix a typo in a function name in the kerneldoc comment next to
resume_target_kernel().

[rjw: Changed the subject slightly, added the changelog.]
Signed-off-by: NBarry Song <Baohua.Song@csr.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

21e82808

PM / Hibernate: Freeze kernel threads after preallocating memory · 2aede851

由 Rafael J. Wysocki 提交于 9月 26, 2011

There is a problem with the current ordering of hibernate code which
leads to deadlocks in some filesystems' memory shrinkers.  Namely,
some filesystems use freezable kernel threads that are inactive when
the hibernate memory preallocation is carried out.  Those same
filesystems use memory shrinkers that may be triggered by the
hibernate memory preallocation.  If those memory shrinkers wait for
the frozen kernel threads, the hibernate process deadlocks (this
happens with XFS, for one example).

Apparently, it is not technically viable to redesign the filesystems
in question to avoid the situation described above, so the only
possible solution of this issue is to defer the freezing of kernel
threads until the hibernate memory preallocation is done, which is
implemented by this change.

Unfortunately, this requires the memory preallocation to be done
before the "prepare" stage of device freeze, so after this change the
only way drivers can allocate additional memory for their freeze
routines in a clean way is to use PM notifiers.
Reported-by: NChristoph <cr2005@u-club.de>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

2aede851

PM / VT: Cleanup #if defined uglyness and fix compile error · 37cce26b

由 H Hartley Sweeten 提交于 9月 21, 2011

Introduce the config option CONFIG_VT_CONSOLE_SLEEP in order to cleanup
the #if defined ugliness for the vt suspend support functions. Note that
CONFIG_VT_CONSOLE is already dependant on CONFIG_VT.

The function pm_set_vt_switch is actually dependant on CONFIG_VT and not
CONFIG_PM_SLEEP. This fixes a compile error when CONFIG_PM_SLEEP is
not set:

drivers/tty/vt/vt_ioctl.c:1794: error: redefinition of 'pm_set_vt_switch'
include/linux/suspend.h:17: error: previous definition of 'pm_set_vt_switch' was here

Also, remove the incorrect path from the comment in console.c.

[rjw: Replaced #if defined() with #ifdef in suspend.h.]
Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

37cce26b

PM / Suspend: Off by one in pm_suspend() · 528f7ce6

由 Dan Carpenter 提交于 9月 21, 2011

In enter_state() we use "state" as an offset for the pm_states[]
array.  The pm_states[] array only has PM_SUSPEND_MAX elements so
this test is off by one.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org

528f7ce6

PM / Hibernate: Include storage keys in hibernation image on s390 · 85055dd8

由 Martin Schwidefsky 提交于 8月 17, 2011

For s390 there is one additional byte associated with each page,
the storage key. This byte contains the referenced and changed
bits and needs to be included into the hibernation image.
If the storage keys are not restored to their previous state all
original pages would appear to be dirty. This can cause
inconsistencies e.g. with read-only filesystems.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

85055dd8

PM: Fix build issue in main.c for CONFIG_PM_SLEEP unset · ca123102

由 Rafael J. Wysocki 提交于 8月 11, 2011

Suspend statistics should depend on CONFIG_PM_SLEEP, so make that
happen.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

ca123102

PM / Suspend: Add statistics debugfs file for suspend to RAM · 2a77c46d

由 ShuoX Liu 提交于 8月 10, 2011

Record S3 failure time about each reason and the latest two failed
devices' names in S3 progress.
We can check it through 'suspend_stats' entry in debugfs.

The motivation of the patch:

We are enabling power features on Medfield. Comparing with PC/notebook,
a mobile enters/exits suspend-2-ram (we call it s3 on Medfield) far
more frequently. If it can't enter suspend-2-ram in time, the power
might be used up soon.

We often find sometimes, a device suspend fails. Then, system retries
s3 over and over again. As display is off, testers and developers
don't know what happens.

Some testers and developers complain they don't know if system
tries suspend-2-ram, and what device fails to suspend. They need
such info for a quick check. The patch adds suspend_stats under
debugfs for users to check suspend to RAM statistics quickly.

If not using this patch, we have other methods to get info about
what device fails. One is to turn on  CONFIG_PM_DEBUG, but users
would get too much info and testers need recompile the system.

In addition, dynamic debug is another good tool to dump debug info.
But it still doesn't match our utilization scenario closely.
1) user need write a user space parser to process the syslog output;
2) Our testing scenario is we leave the mobile for at least hours.
   Then, check its status. No serial console available during the
   testing. One is because console would be suspended, and the other
   is serial console connecting with spi or HSU devices would consume
   power. These devices are powered off at suspend-2-ram.
Signed-off-by: NShuoX Liu <shuox.liu@intel.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

2a77c46d

14 10月, 2011 2 次提交

tracing: Fix returning of duplicate data after EOF in trace_pipe_raw · 436fc280

由 Steven Rostedt 提交于 10月 14, 2011

The trace_pipe_raw handler holds a cached page from the time the file
is opened to the time it is closed. The cached page is used to handle
the case of the user space buffer being smaller than what was read from
the ring buffer. The left over buffer is held in the cache so that the
next read will continue where the data left off.

After EOF is returned (no more data in the buffer), the index of
the cached page is set to zero. If a user app reads the page again
after EOF, the check in the buffer will see that the cached page
is less than page size and will return the cached page again. This
will cause reading the trace_pipe_raw again after EOF to return
duplicate data, making the output look like the time went backwards
but instead data is just repeated.

The fix is to not reset the index right after all data is read
from the cache, but to reset it after all data is read and more
data exists in the ring buffer.

Cc: stable <stable@kernel.org>
Reported-by: NJeremy Eder <jeder@redhat.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

436fc280

ftrace: Fix README to state tracing_on to start/stop tracing · 9b5f8b31

由 Geunsik Lim 提交于 8月 12, 2011

tracing_enabled option is deprecated.
To start/stop tracing, write to /sys/kernel/debug/tracing/tracing_on
without tracing_enabled. This patch is based on Linux 3.1.0-rc1
Signed-off-by: NGeunsik Lim <geunsik.lim@samsung.com>
Link: http://lkml.kernel.org/r/1313127022-23830-1-git-send-email-leemgs1@gmail.comSigned-off-by: NSteven Rostedt <rostedt@goodmis.org>

9b5f8b31

11 10月, 2011 3 次提交

tracing: Do not allocate buffer for trace_marker · d696b58c

由 Steven Rostedt 提交于 9月 22, 2011

When doing intense tracing, the kmalloc inside trace_marker can
introduce side effects to what is being traced.

As trace_marker() is used by userspace to inject data into the
kernel ring buffer, it needs to do so with the least amount
of intrusion to the operations of the kernel or the user space
application.

As the ring buffer is designed to write directly into the buffer
without the need to make a temporary buffer, and userspace already
went through the hassle of knowing how big the write will be,
we can simply pin the userspace pages and write the data directly
into the buffer. This improves the impact of tracing via trace_marker
tremendously!

Thanks to Peter Zijlstra and Thomas Gleixner for pointing out the
use of get_user_pages_fast() and kmap_atomic().
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

d696b58c

tracing: Warn on output if the function tracer was found corrupted · e0a413f6

由 Steven Rostedt 提交于 9月 29, 2011

As the function tracer is very intrusive, lots of self checks are
performed on the tracer and if something is found to be strange
it will shut itself down keeping it from corrupting the rest of the
kernel. This shutdown may still allow functions to be traced, as the
tracing only stops new modifications from happening. Trying to stop
the function tracer itself can cause more harm as it requires code
modification.

Although a WARN_ON() is executed, a user may not notice it. To help
the user see that something isn't right with the tracing of the system
a big warning is added to the output of the tracer that lets the user
know that their data may be incomplete.
Reported-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

e0a413f6

ftrace/kprobes: Fix not to delete probes if in use · 02ca1521

由 Masami Hiramatsu 提交于 10月 04, 2011

Fix kprobe-tracer not to delete a probe if the probe is in use.
In that case, delete operation will return -EBUSY.

This bug can cause a kernel panic if enabled probes are deleted
during perf record.

(Add some probes on functions)
sh-4.2# perf probe --del probe:\*
sh-4.2# exit
(kernel panic)

This is originally reported on the fedora bugzilla:

 https://bugzilla.redhat.com/show_bug.cgi?id=742383

I've also checked that this problem doesn't happen on
tracepoints when module removing because perf event
locks target module.

$ sudo ./perf record -e xfs:\* -aR sh
sh-4.2# rmmod xfs
ERROR: Module xfs is in use
sh-4.2# exit
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.203 MB perf.data (~8862 samples) ]
Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/20111004104438.14591.6553.stgit@fedora15Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

02ca1521

06 10月, 2011 6 次提交

sched: Don't use tasklist_lock for debug prints · 510f5acc

由 Thomas Gleixner 提交于 7月 17, 2011

Avoid taking locks from debug prints, this avoids latencies on -rt,
and improves reliability of the debug code.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

510f5acc

sched: Warn on rt throttling · 1c83437e

由 Thomas Gleixner 提交于 10月 05, 2011

The default rt-throttling is a source of never ending questions. Warn
once when we go into throttling so folks have that info in dmesg.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1110051331480.18778@ionosSigned-off-by: NIngo Molnar <mingo@elte.hu>

1c83437e

sched: Unify the ->cpus_allowed mask copy · 4939602a

由 Peter Zijlstra 提交于 6月 25, 2011

Currently every sched_class::set_cpus_allowed() implementation has to
copy the cpumask into task_struct::cpus_allowed, this is pointless,
put this copy in the generic code.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/n/tip-jhl5s9fckd9ptw1fzbqqlrd3@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

4939602a

sched: Wrap scheduler p->cpus_allowed access · fa17b507

由 Peter Zijlstra 提交于 6月 16, 2011

This task is preparatory for the migrate_disable() implementation, but
stands on its own and provides a cleanup.

It currently only converts those sites required for task-placement.
Kosaki-san once mentioned replacing cpus_allowed with a proper
cpumask_t instead of the NR_CPUS sized array it currently is, that
would also require something like this.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

fa17b507

sched: Request for idle balance during nohz idle load balance · 6eb57e0d

由 Suresh Siddha 提交于 10月 03, 2011

rq's idle_at_tick is set to idle/busy during the timer tick
depending on the cpu was idle or not. This will be used later in the load
balance that will be done in the softirq context (which is a process
context in -RT kernels).

For nohz kernels, for the cpu doing nohz idle load balance on behalf of
all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
recorded when it got the timer tick presumably when it is busy).

As the nohz idle load balancing is also being done at the same place
as the regular load balancing, nohz idle load balancing was bailing out
when it sees rq's idle_at_tick not set.

Thus leading to poor system utilization.

Rename rq's idle_at_tick to idle_balance and set it when someone requests
for nohz idle balance on an idle cpu.
Reported-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

6eb57e0d

sched: Use resched IPI to kick off the nohz idle balance · ca38062e

由 Suresh Siddha 提交于 10月 03, 2011

Current use of smp call function to kick the nohz idle balance can deadlock
in this scenario.

1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
data (csd) to the call single queue, cpu-A took a timer interrupt.  Actual IPI
to cpu-B to process the call single queue is not yet sent.

2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the call single queue was not empty.

3. cpu-A is busy with lot of interrupts

4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.

5. At this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.

6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.

7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().

Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.

To kick a cpu, scheduler already has the reschedule vector reserved. Use
that mechanism (kick_process()) instead of using the generic smp call function
mechanism to kick off the nohz idle load balancing and avoid the deadlock.

   [ This issue is present from 2.6.35+ kernels, but marking it -stable
     only from v3.0+ as the proposed fix depends on the scheduler_ipi()
     that is introduced recently. ]
Reported-by: NPrarit Bhargava <prarit@redhat.com>
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org # v3.0+
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

ca38062e

05 10月, 2011 2 次提交
- T
  rtmutex: Add missing rcu_read_unlock() in debug_rt_mutex_print_deadlock() · 68cc3990
  由 Thomas Gleixner 提交于 10月 05, 2011
```
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
  68cc3990
- T
  genirq: Fix fatfinered fixup really · 32cffdde
  由 Thomas Gleixner 提交于 10月 04, 2011
```
Putting the argument inside the quote does not really help.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
```
  32cffdde
04 10月, 2011 5 次提交

sched: Fix idle_cpu() · 908a3283

由 Thomas Gleixner 提交于 9月 15, 2011

On -rt we observed hackbench waking all 400 tasks to a single cpu.
This is because of select_idle_sibling()'s interaction with the new
ipi based wakeup scheme.

The existing idle_cpu() test only checks to see if the current task on
that cpu is the idle task, it does not take already queued tasks into
account, nor does it take queued to be woken tasks into account.

If the remote wakeup IPIs come hard enough, there won't be time to
schedule away from the idle task, and would thus keep thinking the cpu
was in fact idle, regardless of the fact that there were already
several hundred tasks runnable.

We couldn't reproduce on mainline, but there's no reason it couldn't
happen.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

908a3283

sched: Convert to struct llist · fa14ff4a

由 Peter Zijlstra 提交于 9月 12, 2011

Use the generic llist primitives.

We had a private lockless list implementation in the scheduler in the wake-list
code, now that we have a generic llist implementation that provides all required
operations, switch to it.

This patch is not expected to change any behavior.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1315836353.26517.42.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

fa14ff4a

llist: Add llist_next() · 924f8f5a

由 Peter Zijlstra 提交于 9月 12, 2011

So we don't have to expose the struct list_node member.

Cc: Huang Ying <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1315836348.26517.41.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

924f8f5a

irq_work: Use llist in the struct irq_work logic · 38aaf809

由 Huang Ying 提交于 9月 08, 2011

Use llist in irq_work instead of the lock-less linked list
implementation in irq_work to avoid the code duplication.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1315461646-1379-6-git-send-email-ying.huang@intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

38aaf809

ipv4: NET_IPV4_ROUTE_GC_INTERVAL removal · 349d2895

由 Vasily Averin 提交于 9月 30, 2011

removing obsoleted sysctl,
ip_rt_gc_interval variable no longer used since 2.6.38
Signed-off-by: NVasily Averin <vvs@sw.ru>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

349d2895

03 10月, 2011 2 次提交

genirq: percpu: allow interrupt type to be set at enable time · 1e7c5fd2

由 Marc Zyngier 提交于 9月 30, 2011

As request_percpu_irq() doesn't allow for a percpu interrupt to have
its type configured (it is generally impossible to configure it on all
CPUs at once), add a 'type' argument to enable_percpu_irq().

This allows some low-level, board specific init code to be switched to
a generic API.

[ tglx: Added WARN_ON argument ]
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

1e7c5fd2

genirq: Add support for per-cpu dev_id interrupts · 31d9d9b6

由 Marc Zyngier 提交于 9月 23, 2011

The ARM GIC interrupt controller offers per CPU interrupts (PPIs),
which are usually used to connect local timers to each core. Each CPU
has its own private interface to the GIC, and only sees the PPIs that
are directly connect to it.

While these timers are separate devices and have a separate interrupt
line to a core, they all use the same IRQ number.

For these devices, request_irq() is not the right API as it assumes
that an IRQ number is visible by a number of CPUs (through the
affinity setting), but makes it very awkward to express that an IRQ
number can be handled by all CPUs, and yet be a different interrupt
line on each CPU, requiring a different dev_id cookie to be passed
back to the handler.

The *_percpu_irq() functions is designed to overcome these
limitations, by providing a per-cpu dev_id vector:

int request_percpu_irq(unsigned int irq, irq_handler_t handler,
		   const char *devname, void __percpu *percpu_dev_id);
void free_percpu_irq(unsigned int, void __percpu *);
int setup_percpu_irq(unsigned int irq, struct irqaction *new);
void remove_percpu_irq(unsigned int irq, struct irqaction *act);
void enable_percpu_irq(unsigned int irq);
void disable_percpu_irq(unsigned int irq);

The API has a number of limitations:
- no interrupt sharing
- no threading
- common handler across all the CPUs

Once the interrupt is requested using setup_percpu_irq() or
request_percpu_irq(), it must be enabled by each core that wishes its
local interrupt to be delivered.

Based on an initial patch by Thomas Gleixner.
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1316793788-14500-2-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

31d9d9b6

30 9月, 2011 1 次提交

posix-cpu-timers: Cure SMP wobbles · d670ec13

由 Peter Zijlstra 提交于 9月 01, 2011

David reported:

  Attached below is a watered-down version of rt/tst-cpuclock2.c from
  GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
  similar.

  Run it several times, and you will see cases where the main thread
  will measure a process clock difference before and after the nanosleep
  which is smaller than the cpu-burner thread's individual thread clock
  difference.  This doesn't make any sense since the cpu-burner thread
  is part of the top-level process's thread group.

  I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
  64-bit binaries).

  For example:

  [davem@boricha build-x86_64-linux]$ ./test
  process: before(0.001221967) after(0.498624371) diff(497402404)
  thread:  before(0.000081692) after(0.498316431) diff(498234739)
  self:    before(0.001223521) after(0.001240219) diff(16698)
  [davem@boricha build-x86_64-linux]$ 

  The diff of 'process' should always be >= the diff of 'thread'.

  I make sure to wrap the 'thread' clock measurements the most tightly
  around the nanosleep() call, and that the 'process' clock measurements
  are the outer-most ones.

  ---
  #include <unistd.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>
  #include <fcntl.h>
  #include <string.h>
  #include <errno.h>
  #include <pthread.h>

  static pthread_barrier_t barrier;

  static void *chew_cpu(void *arg)
  {
	  pthread_barrier_wait(&barrier);
	  while (1)
		  __asm__ __volatile__("" : : : "memory");
	  return NULL;
  }

  int main(void)
  {
	  clockid_t process_clock, my_thread_clock, th_clock;
	  struct timespec process_before, process_after;
	  struct timespec me_before, me_after;
	  struct timespec th_before, th_after;
	  struct timespec sleeptime;
	  unsigned long diff;
	  pthread_t th;
	  int err;

	  err = clock_getcpuclockid(0, &process_clock);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
	  if (err)
		  return 1;

	  pthread_barrier_init(&barrier, NULL, 2);
	  err = pthread_create(&th, NULL, chew_cpu, NULL);
	  if (err)
		  return 1;

	  err = pthread_getcpuclockid(th, &th_clock);
	  if (err)
		  return 1;

	  pthread_barrier_wait(&barrier);

	  err = clock_gettime(process_clock, &process_before);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &me_before);
	  if (err)
		  return 1;

	  err = clock_gettime(th_clock, &th_before);
	  if (err)
		  return 1;

	  sleeptime.tv_sec = 0;
	  sleeptime.tv_nsec = 500000000;
	  nanosleep(&sleeptime, NULL);

	  err = clock_gettime(th_clock, &th_after);
	  if (err)
		  return 1;

	  err = clock_gettime(my_thread_clock, &me_after);
	  if (err)
		  return 1;

	  err = clock_gettime(process_clock, &process_after);
	  if (err)
		  return 1;

	  diff = process_after.tv_nsec - process_before.tv_nsec;
	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 process_before.tv_sec, process_before.tv_nsec,
		 process_after.tv_sec, process_after.tv_nsec, diff);
	  diff = th_after.tv_nsec - th_before.tv_nsec;
	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 th_before.tv_sec, th_before.tv_nsec,
		 th_after.tv_sec, th_after.tv_nsec, diff);
	  diff = me_after.tv_nsec - me_before.tv_nsec;
	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
		 me_before.tv_sec, me_before.tv_nsec,
		 me_after.tv_sec, me_after.tv_nsec, diff);

	  return 0;
  }

This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.

This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().

Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.
Reported-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twinsTested-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

d670ec13