提交 · b4bc842802db3314f9a657094da0450a903ea619 · openanolis / cloud-kernel

19 5月, 2011 1 次提交

module: deal with alignment issues in built-in module versions · b4bc8428

由 Dmitry Torokhov 提交于 2月 07, 2011

On m68k natural alignment is 2-byte boundary but we are trying to
align structures in __modver section on sizeof(void *) boundary.
This causes trouble when we try to access elements in this section
in array-like fashion when create "version" attributes for built-in
modules.

Moreover, as DaveM said, we can't reliably put structures into
independent objects, put them into a special section, and then expect
array access over them (via the section boundaries) after linking the
objects together to just "work" due to variable alignment choices in
different situations. The only solution that seems to work reliably
is to make an array of plain pointers to the objects in question and
put those pointers in the special section.
Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NDmitry Torokhov <dtor@vmware.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

b4bc8428

17 5月, 2011 1 次提交

tick: Clear broadcast active bit when switching to oneshot · 07f4beb0

由 Thomas Gleixner 提交于 5月 16, 2011

The first cpu which switches from periodic to oneshot mode switches
also the broadcast device into oneshot mode. The broadcast device
serves as a backup for per cpu timers which stop in deeper
C-states. To avoid starvation of the cpus which might be in idle and
depend on broadcast mode it marks the other cpus as broadcast active
and sets the brodcast expiry value of those cpus to the next tick.

The oneshot mode broadcast bit for the other cpus is sticky and gets
only cleared when those cpus exit idle. If a cpu was not idle while
the bit got set in consequence the bit prevents that the broadcast
device is armed on behalf of that cpu when it enters idle for the
first time after it switched to oneshot mode.

In most cases that goes unnoticed as one of the other cpus has usually
a timer pending which keeps the broadcast device armed with a short
timeout. Now if the only cpu which has a short timer active has the
bit set then the broadcast device will not be armed on behalf of that
cpu and will fire way after the expected timer expiry. In the case of
Christians bug report it took ~145 seconds which is about half of the
wrap around time of HPET (the limit for that device) due to the fact
that all other cpus had no timers armed which expired before the 145
seconds timeframe.

The solution is simply to clear the broadcast active bit
unconditionally when a cpu switches to oneshot mode after the first
cpu switched the broadcast device over. It's not idle at that point
otherwise it would not be executing that code.

[ I fundamentally hate that broadcast crap. Why the heck thought some
  folks that when going into deep idle it's a brilliant concept to
  switch off the last device which brings the cpu back from that
  state? ]

Thanks to Christian for providing all the valuable debug information!
Reported-and-tested-by: NChristian Hoffmann <email@christianhoffmann.info>
Cc: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1105161105170.3078%40ionos%3E
Cc: stable@kernel.org
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

07f4beb0

14 5月, 2011 1 次提交

Cache user_ns in struct cred · 47a150ed

由 Serge E. Hallyn 提交于 5月 13, 2011

If !CONFIG_USERNS, have current_user_ns() defined to (&init_user_ns).

Get rid of _current_user_ns.  This requires nsown_capable() to be
defined in capability.c rather than as static inline in capability.h,
so do that.

Request_key needs init_user_ns defined at current_user_ns if
!CONFIG_USERNS, so forward-declare that in cred.h if !CONFIG_USERNS
at current_user_ns() define.

Compile-tested with and without CONFIG_USERNS.
Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
[ This makes a huge performance difference for acl_permission_check(),
  up to 30%.  And that is one of the hottest kernel functions for loads
  that are pathname-lookup heavy.  ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

47a150ed

12 5月, 2011 3 次提交

PM / Hibernate: Fix ioctl SNAPSHOT_S2RAM · 36cb7035

由 Rafael J. Wysocki 提交于 5月 10, 2011

The SNAPSHOT_S2RAM ioctl used for implementing the feature allowing
one to suspend to RAM after creating a hibernation image is currently
broken, because it doesn't clear the "ready" flag in the struct
snapshot_data object handled by it.  As a result, the
SNAPSHOT_UNFREEZE doesn't work correctly after SNAPSHOT_S2RAM has
returned and the user space hibernate task cannot thaw the other
processes as appropriate.  Make SNAPSHOT_S2RAM clear data->ready
to fix this problem.
Tested-by: NAlexandre Felipe Muller de Souza <alexandrefm@mandriva.com.br>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org

36cb7035

PM / Hibernate: Make snapshot_release() restore GFP mask · 9744997a

由 Rafael J. Wysocki 提交于 5月 10, 2011

If the process using the hibernate user space interface closes
/dev/snapshot after creating a hibernation image without thawing
tasks, snapshot_release() should call pm_restore_gfp_mask() to
restore the GFP mask used before the creation of the image.  Make
that happen.
Tested-by: NAlexandre Felipe Muller de Souza <alexandrefm@mandriva.com.br>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org

9744997a

PM: Fix warning in pm_restrict_gfp_mask() during SNAPSHOT_S2RAM ioctl · 87186475

由 Rafael J. Wysocki 提交于 5月 10, 2011

A warning is printed by pm_restrict_gfp_mask() while the
SNAPSHOT_S2RAM ioctl is being executed after creating a hibernation
image, because pm_restrict_gfp_mask() has been called once already
before the image creation and suspend_devices_and_enter() calls it
once again.  This happens after commit 452aa699
(mm/pm: force GFP_NOIO during suspend/hibernation and resume).

To avoid this issue, move pm_restrict_gfp_mask() and
pm_restore_gfp_mask() from suspend_devices_and_enter() to its caller
in kernel/power/suspend.c.
Reported-by: NAlexandre Felipe Muller de Souza <alexandrefm@mandriva.com.br>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Cc: stable@kernel.org

87186475

07 5月, 2011 1 次提交

Regression: partial revert "tracing: Remove lock_depth from event entry" · a3a4a5ac

由 Arjan van de Ven 提交于 5月 05, 2011

This partially reverts commit e6e1e259.

That commit changed the structure layout of the trace structure, which
in turn broke PowerTOP (1.9x generation) quite badly.

I appreciate not wanting to expose the variable in question, and
PowerTOP was not using it, so I've replaced the variable with just a
padding field - that way if in the future a new field is needed it can
just use this padding field.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a3a4a5ac

05 5月, 2011 1 次提交

clocksource: Install completely before selecting · e05b2efb

由 john stultz 提交于 5月 04, 2011

Christian Hoffmann reported that the command line clocksource override
with acpi_pm timer fails:

 Kernel command line: <SNIP> clocksource=acpi_pm
 hpet clockevent registered
 Switching to clocksource hpet
 Override clocksource acpi_pm is not HRT compatible.
 Cannot switch while in HRT/NOHZ mode.

The watchdog code is what enables CLOCK_SOURCE_VALID_FOR_HRES, but we
actually end up selecting the clocksource before we enqueue it into
the watchdog list, so that's why we see the warning and fail to switch
to acpi_pm timer as requested. That's particularly bad when we want to
debug timekeeping related problems in early boot.

Put the selection call last.
Reported-by: NChristian Hoffmann <email@christianhoffmann.info>
Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
Cc: stable@kernel.org # 32...
Link: http://lkml.kernel.org/r/%3C1304558210.2943.24.camel%40work-vm%3ESigned-off-by: NThomas Gleixner <tglx@linutronix.de>

e05b2efb

03 5月, 2011 1 次提交

genirq: Fix typo CONFIG_GENIRC_IRQ_SHOW_LEVEL · 94b2c363

由 Geert Uytterhoeven 提交于 4月 30, 2011

commit ab7798ff ("genirq: Expand generic
show_interrupts()") added the Kconfig option GENERIC_IRQ_SHOW_LEVEL to
accomodate PowerPC, but this doesn't actually enable the functionality due
to a typo in the #ifdef check.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: Linux/PPC Development <linuxppc-dev@lists.ozlabs.org>
Link: http://lkml.kernel.org/r/%3Calpine.DEB.2.00.1104302251370.19068%40ayla.of.borg%3ESigned-off-by: NThomas Gleixner <tglx@linutronix.de>

94b2c363

30 4月, 2011 1 次提交

workqueue: fix deadlock in worker_maybe_bind_and_lock() · 5035b20f

由 Tejun Heo 提交于 4月 29, 2011

If a rescuer and stop_machine() bringing down a CPU race with each
other, they may deadlock on non-preemptive kernel.  The CPU won't
accept a new task, so the rescuer can't migrate to the target CPU,
while stop_machine() can't proceed because the rescuer is holding one
of the CPU retrying migration.  GCWQ_DISASSOCIATED is never cleared
and worker_maybe_bind_and_lock() retries indefinitely.

This problem can be reproduced semi reliably while the system is
entering suspend.

 http://thread.gmane.org/gmane.linux.kernel/1122051

A lot of kudos to Thilo-Alexander for reporting this tricky issue and
painstaking testing.

stable: This affects all kernels with cmwq, so all kernels since and
        including v2.6.36 need this fix.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NThilo-Alexander Ginkel <thilo@ginkel.com>
Tested-by: NThilo-Alexander Ginkel <thilo@ginkel.com>
Cc: stable@kernel.org

5035b20f

29 4月, 2011 2 次提交

hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically · ce31332d

由 Thomas Gleixner 提交于 4月 29, 2011

Sedat and Bruno reported RCU stalls which turned out to be caused by
the following;

sched_init() calls init_rt_bandwidth() which calls hrtimer_init()
_BEFORE_ hrtimers_init() is called. While not entirely correct this
worked because hrtimer_init() only accessed statically initialized
data (hrtimer_bases.clock_base[CLOCK_MONOTONIC])

Commit e06383db (hrtimers: extend hrtimer base code to handle more
then 2 clockids) added an indirection to the hrtimer_bases.clock_base
lookup to avoid gap handling in the hot path. The table which is used
for the translataion from CLOCK_ID to HRTIMER_BASE index is
initialized at runtime in hrtimers_init(). So the early call of the
scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME.

Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is
armed and the wall clock time is set (e.g. ntpdate in the early boot
process - which also gives the problem deterministic behaviour
i.e. magic recovery after N hours), then the timer ends up with an
expiry time far into the future. That breaks the RT throttler
mechanism as rt runtime is accumulated and never cleared, so the rt
throttler detects a false cpu hog condition and blocks all RT tasks
until the timer finally expires. That in turn stalls the RCU thread of
TINYRCU which leads to an huge amount of RCU callbacks piling up.

Make the translation table statically initialized, so we are back to
the status of <= 2.6.39.
Reported-and-tested-by: NSedat Dilek <sedat.dilek@gmail.com>
Reported-by: NBruno Prémont <bonbons@linux-vserver.org>
Cc: John stultz <johnstul@us.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3EReviewed-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

ce31332d

kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdog · 1409f141

由 Hillf Danton 提交于 4月 27, 2011

In corner cases where softlockup watchdog is not setup successfully, the
relevant nmi perf event for hardlockup watchdog could be disabled, then
the status of the underlying hardware remains unchanged.

Also, if the kthread doesn't start then the hrtimer won't run and the
hardlockup detector will falsely fire.
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NDon Zickus <dzickus@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1409f141

25 4月, 2011 1 次提交

ptrace: Prepare to fix racy accesses on task breakpoints · bf26c018

由 Frederic Weisbecker 提交于 4月 07, 2011

When a task is traced and is in a stopped state, the tracer
may execute a ptrace request to examine the tracee state and
get its task struct. Right after, the tracee can be killed
and thus its breakpoints released.
This can happen concurrently when the tracer is in the middle
of reading or modifying these breakpoints, leading to dereferencing
a freed pointer.

Hence, to prepare the fix, create a generic breakpoint reference
holding API. When a reference on the breakpoints of a task is
held, the breakpoints won't be released until the last reference
is dropped. After that, no more ptrace request on the task's
breakpoints can be serviced for the tracer.
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: v2.6.33.. <stable@kernel.org>
Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com

bf26c018

21 4月, 2011 1 次提交

ftrace: Build without frame pointers on Microblaze · d20ac252

由 Michal Simek 提交于 4月 04, 2011

Microblaze doesn't need/support FRAME_POINTERS in order to have a working
function tracer.

The patch remove Kconfig warning.

Warning log:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP &&
FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct
dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 ||
SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS)
Signed-off-by: NMichal Simek <monstr@monstr.eu>
Link: http://lkml.kernel.org/r/1301908812-8119-2-git-send-email-monstr@monstr.eu
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Ingo Molnar <mingo@redhat.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

d20ac252

20 4月, 2011 1 次提交

PM: Add missing syscore_suspend() and syscore_resume() calls · 19234c08

由 Rafael J. Wysocki 提交于 4月 20, 2011

Device suspend/resume infrastructure is used not only by the suspend
and hibernate code in kernel/power, but also by APM, Xen and the
kexec jump feature.  However, commit 40dc166c
(PM / Core: Introduce struct syscore_ops for core subsystems PM)
failed to add syscore_suspend() and syscore_resume() calls to that
code, which generally leads to breakage when the features in question
are used.

To fix this problem, add the missing syscore_suspend() and
syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c
and drivers/xen/manage.c.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
Acked-by: NIan Campbell <ian.campbell@citrix.com>

19234c08

19 4月, 2011 2 次提交

PM: Fix error code paths executed after failing syscore_suspend() · 2ca6f62f

由 Rafael J. Wysocki 提交于 4月 18, 2011

If syscore_suspend() fails in suspend_enter(), create_image() or
resume_target_kernel(), it is necessary to call sysdev_resume(),
because sysdev_suspend() has been called already and succeeded
and we are going to abort the transition.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>

2ca6f62f

next_pidmap: fix overflow condition · c78193e9

由 Linus Torvalds 提交于 4月 18, 2011

next_pidmap() just quietly accepted whatever 'last' pid that was passed
in, which is not all that safe when one of the users is /proc.

Admittedly the proc code should do some sanity checking on the range
(and that will be the next commit), but that doesn't mean that the
helper functions should just do that pidmap pointer arithmetic without
checking the range of its arguments.

So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
doesn't really matter, the for-loop does check against the end of the
pidmap array properly (it's only the actual pointer arithmetic overflow
case we need to worry about, and going one bit beyond isn't going to
overflow).

[ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
Analyzed-by: NRobert Święcki <robert@swiecki.net>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c78193e9

18 4月, 2011 1 次提交

posix clocks: Replace mutex with reader/writer semaphore · 1791f881

由 Richard Cochran 提交于 3月 30, 2011

A dynamic posix clock is protected from asynchronous removal by a mutex.
However, using a mutex has the unwanted effect that a long running clock
operation in one process will unnecessarily block other processes.

For example, one process might call read() to get an external time stamp
coming in at one pulse per second. A second process calling clock_gettime
would have to wait for almost a whole second.

This patch fixes the issue by using a reader/writer semaphore instead of
a mutex.
Signed-off-by: NRichard Cochran <richard.cochran@omicron.at>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/%3C20110330132421.GA31771%40riccoc20.at.omicron.at%3ESigned-off-by: NThomas Gleixner <tglx@linutronix.de>

1791f881

16 4月, 2011 2 次提交

block: make unplug timer trace event correspond to the schedule() unplug · 49cac01e

由 Jens Axboe 提交于 4月 16, 2011

It's a pretty close match to what we had before - the timer triggering
would mean that nobody unplugged the plug in due time, in the new
scheme this matches very closely what the schedule() unplug now is.
It's essentially the difference between an explicit unplug (IO unplug)
or an implicit unplug (timer unplug, we scheduled with pending IO
queued).
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

49cac01e

block: let io_schedule() flush the plug inline · a237c1c5

由 Jens Axboe 提交于 4月 16, 2011

Linus correctly observes that the most important dispatch cases
are now done from kblockd, this isn't ideal for latency reasons.
The original reason for switching dispatches out-of-line was to
avoid too deep a stack, so by _only_ letting the "accidental"
flush directly in schedule() be guarded by offload to kblockd,
we should be able to get the best of both worlds.

So add a blk_schedule_flush_plug() that offloads to kblockd,
and only use that from the schedule() path.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

a237c1c5

15 4月, 2011 1 次提交

futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup · 0cd9c649

由 Darren Hart 提交于 4月 14, 2011

The FLAGS_HAS_TIMEOUT flag was not getting set, causing the restart_block to
restart futex_wait() without a timeout after a signal.

Commit b41277dc in 2.6.38 introduced the regression by accidentally
removing the the FLAGS_HAS_TIMEOUT assignment from futex_wait() during the setup
of the restart block. Restore the originaly behavior.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=32922Reported-by: NTim Smith <tsmith201104@yahoo.com>
Reported-by: NTorsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Kacur <jkacur@redhat.com>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/%3Cdaac0eb3af607f72b9a4d3126b2ba8fb5ed3b883.1302820917.git.dvhart%40linux.intel.com%3ESigned-off-by: NThomas Gleixner <tglx@linutronix.de>

0cd9c649

13 4月, 2011 1 次提交

block: don't flush plugged IO on forced preemtion scheduling · 6631e635

由 Linus Torvalds 提交于 4月 13, 2011

We really only want to unplug the pending IO when the process actually
goes to sleep.  So move the test for flushing the plug up to the place
where we actually deactivate the task - where we have properly checked
for preemption and for the process really sleeping.
Acked-by: NJens Axboe <jaxboe@fusionio.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6631e635

12 4月, 2011 4 次提交

block: fixup block IO unplug trace call · 94b5eb28

由 Jens Axboe 提交于 4月 12, 2011

It was removed with the on-stack plugging, readd it and track the
depth of requests added when flushing the plug.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

94b5eb28

block: remove block_unplug_timer() trace point · d9c97833

由 Jens Axboe 提交于 4月 12, 2011

We no longer have an unplug timer running, so no point in keeping
the trace point.
Signed-off-by: NJens Axboe <jaxboe@fusionio.com>

d9c97833

fix XEN_SAVE_RESTORE Kconfig dependencies · d419e4c0

由 Shriram Rajagopalan 提交于 4月 11, 2011

Make XEN_SAVE_RESTORE select HIBERNATE_CALLBACKS.
Remove XEN_SAVE_RESTORE dependency from PM_SLEEP.
Signed-off-by: NShriram Rajagopalan <rshriram@cs.ubc.ca>
Acked-by: NIan Campbell <ian.campbell@citrix.com>
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

d419e4c0

PM / Hibernate: Introduce CONFIG_HIBERNATE_CALLBACKS · 1f112cee

由 Rafael J. Wysocki 提交于 4月 11, 2011

Xen save/restore is going to use hibernate device callbacks for
quiescing devices and putting them back to normal operations and it
would need to select CONFIG_HIBERNATION for this purpose.  However,
that also would cause the hibernate interfaces for user space to be
enabled, which might confuse user space, because the Xen kernels
don't support hibernation.  Moreover, it would be wasteful, as it
would make the Xen kernels include a substantial amount of code that
they would never use.

To address this issue introduce new power management Kconfig option
CONFIG_HIBERNATE_CALLBACKS, such that it will only select the code
that is necessary for the hibernate device callbacks to work and make
CONFIG_HIBERNATION select it.  Then, Xen save/restore will be able to
select CONFIG_HIBERNATE_CALLBACKS without dragging the entire
hibernate code along with it.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Tested-by: NShriram Rajagopalan <rshriram@cs.ubc.ca>

1f112cee

11 4月, 2011 3 次提交

sched: Fix erroneous all_pinned logic · b30aef17

由 Ken Chen 提交于 4月 08, 2011

The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.

This all works as designed if there is only one cgroup in the system.

However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.

The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:

load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved. It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.

To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.
Signed-off-by: NKen Chen <kenchen@google.com>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

b30aef17

sched: Fix sched-domain avg_load calculation · b0432d8f

由 Ken Chen 提交于 4月 07, 2011

In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.

The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.

This has two really bad effect:

1. stampede of task migration, and they won't be able to break out
   of the bad state because of positive feedback loop: large load
   delta -> heavier load migration -> larger imbalance and the cycle
   goes on.

2. severe imbalance in CPU queue depth.  This causes really long
   scheduling latency blip which affects badly on application that
   has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.
Signed-off-by: NKen Chen <kenchen@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

b0432d8f

perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() · e566b76e

由 Stephane Eranian 提交于 4月 06, 2011

There is a bug in perf_event_enable_on_exec() when cgroup events are
active on a CPU: the cgroup events may be scheduled twice causing event
state corruptions which eventually may lead to kernel panics.

The reason is that the function needs to first schedule out the cgroup
events, just like for the per-thread events. The cgroup event are
scheduled back in automatically from the perf_event_context_sched_in()
function.

The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
bogus state.
Signed-off-by: NStephane Eranian <eranian@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110406005454.GA1062@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>

e566b76e

09 4月, 2011 1 次提交

signal.c: fix erroneous syscall kernel-doc · f9fa0bc1

由 Randy Dunlap 提交于 4月 08, 2011

Fix erroneous syscall kernel-doc comments in kernel/signal.c.
Reported-by: NMatt Fleming <matt@console-pimps.org>
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9fa0bc1

05 4月, 2011 3 次提交

sched: Clean up rebalance_domains() load-balance interval calculation · 49c022e6

由 Peter Zijlstra 提交于 4月 05, 2011

Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1301991265.2225.12.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

49c022e6

kernel/signal.c: add kernel-doc notation to syscalls · 41c57892

由 Randy Dunlap 提交于 4月 04, 2011

Add kernel-doc to syscalls in signal.c.
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

41c57892

kernel/signal.c: fix typos and coding style · 5aba085e

由 Randy Dunlap 提交于 4月 04, 2011

General coding style and comment fixes; no code changes:

 - Use multi-line-comment coding style.
 - Put some function signatures completely on one line.
 - Hyphenate some words.
 - Spell Posix as POSIX.
 - Correct typos & spellos in some comments.
 - Drop trailing whitespace.
 - End sentences with periods.
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5aba085e

04 4月, 2011 1 次提交

ntp: fix non privileged system time shifting · 4352d9d4

由 Richard Cochran 提交于 4月 04, 2011

The ADJ_SETOFFSET bit added in commit 094aa188 ("ntp: Add ADJ_SETOFFSET
mode bit") also introduced a way for any user to change the system time.
Sneaky or buggy calls to adjtimex() could set

    ADJ_OFFSET_SS_READ | ADJ_SETOFFSET

which would result in a successful call to timekeeping_inject_offset().
This patch fixes the issue by adding the capability check.
Signed-off-by: NRichard Cochran <richard.cochran@omicron.at>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4352d9d4

03 4月, 2011 1 次提交

genirq: Fix cpumask leak in __setup_irq() · 4f5058c3

由 Xiaotian Feng 提交于 4月 02, 2011

The allocated cpumask should be freed in __setup_irq().
Signed-off-by: NXiaotian Feng <dfeng@redhat.com>
LKML-Reference: <1301744375-6812-1-git-send-email-dfeng@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

4f5058c3

01 4月, 2011 1 次提交

kdump: Allow shrinking of kdump region to be overridden · c0bb9e45

由 Anton Blanchard 提交于 8月 25, 2010

On ppc64 the crashkernel region almost always overlaps an area of firmware.
This works fine except when using the sysfs interface to reduce the kdump
region. If we free the firmware area we are guaranteed to crash.

Rename free_reserved_phys_range to crash_free_reserved_phys_range and make
it a weak function so we can override it.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

c0bb9e45

31 3月, 2011 4 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

perf: Fix task_struct reference leak · fd1edb3a

由 Peter Zijlstra 提交于 3月 28, 2011

sys_perf_event_open() had an imbalance in the number of task refs it
took causing memory leakage

Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org # .37+
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fd1edb3a

perf: Rebase max unprivileged mlock threshold on top of page size · 20443384

由 Frederic Weisbecker 提交于 3月 31, 2011

Ensure we allow 512 kiB + 1 page for user control without
assuming a 4096 bytes page size.
Reported-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: <stable@kernel.org>
LKML-Reference: <1301535209-9679-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

20443384

sched: Fix rebalance interval calculation · 3436ae12

由 Sisir Koppaka 提交于 3月 26, 2011

The interval for checking scheduling domains if they are due to be
balanced currently depends on boot state NR_CPUS, which may not
accurately reflect the number of online CPUs at the time of check.

Thus replace NR_CPUS with num_online_cpus().

 (ed: Should only affect those who set NR_CPUS really high, such as 4096
      or so :-)
Signed-off-by: NSisir Koppaka <sisir.koppaka@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTikqHWid2Q93F5U5Qw5snJH8C5PXoa7J6=6hYO94@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3436ae12

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功