提交 · 77f827de07432a74821cf0f831d699544b2d474f · openanolis / cloud-kernel

04 9月, 2012 1 次提交

PM / Domains: Add power off/on function for system core suspend stage · 77f827de

由 Rafael J. Wysocki 提交于 8月 06, 2012

Introduce function pm_genpd_syscore_switch() and two wrappers around
it, pm_genpd_syscore_poweroff() and pm_genpd_syscore_poweron(),
allowing the callers to let the generic PM domains framework know
that the given device is not necessary any more and its PM domain
can be turned off (the former) or that the given device will be
required immediately, so its PM domain has to be turned on (the
latter) during the system core (syscore) stage of system suspend
(or hibernation) and resume.

These functions will be used for handling devices registered as
clock sources and clock event devices that belong to PM domains.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

77f827de

02 9月, 2012 1 次提交

time: Move ktime_t overflow checking into timespec_valid_strict · cee58483

由 John Stultz 提交于 8月 31, 2012

Andreas Bombe reported that the added ktime_t overflow checking added to
timespec_valid in commit 4e8b1452 ("time: Improve sanity checking of
timekeeping inputs") was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.

Previously, these large timeouts would be clamped to KTIME_MAX and would
never expire, which is valid.

This patch splits the ktime_t overflow checking into a new
timespec_valid_strict function, and converts the timekeeping codes
internal checking to use this more strict function.
Reported-and-tested-by: NAndreas Bombe <aeb@debian.org>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cee58483

22 8月, 2012 6 次提交

task_work: add a scheduling point in task_work_run() · 88ec2789

由 Eric Dumazet 提交于 8月 21, 2012

It seems commit 4a9d4b02 (switch fput to task_work_add) reintroduced
the problem addressed in commit 944be0b2 (close_files(): add scheduling
point)

If a server process with a lot of files (say 2 million tcp sockets)
is killed, we can spend a lot of time in task_work_run() and trigger
a soft lockup.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

88ec2789

time: Avoid making adjustments if we haven't accumulated anything · bf2ac312

由 John Stultz 提交于 8月 21, 2012

If update_wall_time() is called and the current offset isn't large
enough to accumulate, avoid re-calling timekeeping_adjust which may
change the clock freq and can cause 1ns inconsistencies with
CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

bf2ac312

time: Avoid potential shift overflow with large shift values · 6ea565a9

由 John Stultz 提交于 8月 21, 2012

Andreas Schwab noticed that the 1 << tk->shift could overflow if the
shift value was greater than 30, since 1 would be a 32bit long on
32bit architectures. This issue was introduced by 1e75fa8b (time:
Condense timekeeper.xtime into xtime_sec)

Use 1ULL instead to ensure we don't overflow on the shift.
Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

6ea565a9

time: Fix casting issue in timekeeping_forward_now · 85dc8f05

由 Andreas Schwab 提交于 8月 21, 2012

arch_gettimeoffset returns a u32 value which when shifted by tk->shift
can overflow. This issue was introduced with 1e75fa8b (time: Condense
timekeeper.xtime into xtime_sec)

Cast it to u64 first.
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

85dc8f05

time: Ensure we normalize the timekeeper in tk_xtime_add · 784ffcbb

由 John Stultz 提交于 8月 21, 2012

Andreas noticed problems with resume on specific hardware after commit
1e75fa8b (time: Condense timekeeper.xtime into xtime_sec) combined
with commit b44d50dc (time: Fix casting issue in tk_set_xtime and
tk_xtime_add)

After some digging I realized we aren't normalizing the timekeeper
after the add. Add the missing normalize call.
Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
Tested-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/1345595449-34965-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

784ffcbb

task_work: add a scheduling point in task_work_run() · f341861f

由 Eric Dumazet 提交于 8月 21, 2012

It seems commit 4a9d4b02 ("switch fput to task_work_add") re-
introduced the problem addressed in 944be0b2 ("close_files(): add
scheduling point")

If a server process with a lot of files (say 2 million tcp sockets) is
killed, we can spend a lot of time in task_work_run() and trigger a soft
lockup.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f341861f

21 8月, 2012 1 次提交

uprobes: Fix mmap_region()'s mm->mm_rb corruption if uprobe_mmap() fails · c7a3a88c

由 Oleg Nesterov 提交于 8月 19, 2012

This patch fixes:

  https://bugzilla.redhat.com/show_bug.cgi?id=843640

If mmap_region()->uprobe_mmap() fails, unmap_and_free_vma path
does unmap_region() but does not remove the soon-to-be-freed vma
from rb tree. Actually there are more problems but this is how
William noticed this bug.

Perhaps we could do do_munmap() + return in this case, but in
fact it is simply wrong to abort if uprobe_mmap() fails. Until
at least we move the !UPROBE_COPY_INSN code from
install_breakpoint() to uprobe_register().

For example, uprobe_mmap()->install_breakpoint() can fail if the
probed insn is not supported (remember, uprobe_register()
succeeds if nobody mmaps inode/offset), mmap() should not fail
in this case.

dup_mmap()->uprobe_mmap() is wrong too by the same reason,
fork() can race with uprobe_register() and fail for no reason if
it wins the race and does install_breakpoint() first.

And, if nothing else, both mmap_region() and dup_mmap() return
success if uprobe_mmap() fails. Change them to ignore the error
code from uprobe_mmap().
Reported-and-tested-by: NWilliam Cohen <wcohen@redhat.com>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> # v3.5
Cc: Anton Arapov <anton@redhat.com>
Cc: William Cohen <wcohen@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120819171042.GB26957@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

c7a3a88c

19 8月, 2012 1 次提交

alpha: take a bunch of syscalls into osf_sys.c · be53db6e

由 Al Viro 提交于 8月 19, 2012

New helper: current_thread_info().  Allows to do a bunch of odd syscalls
in C. While we are at it, there had never been a reason to do
osf_getpriority() in assembler.  We also get "namespace"-aware (read:
consistent with getuid(2), etc.) behaviour from getx?id() syscalls now.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NMichael Cree <mcree@orcon.net.nz>
Acked-by: NMatt Turner <mattst88@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be53db6e

18 8月, 2012 1 次提交

tracing/syscalls: Fix perf syscall tracing when syscall_nr == -1 · 60916a93

由 Will Deacon 提交于 8月 16, 2012

syscall_get_nr can return -1 in the case that the task is not executing
a system call.

This patch fixes perf_syscall_{enter,exit} to check that the syscall
number is valid before using it as an index into a bitmap.

Link: http://lkml.kernel.org/r/1345137254-7377-1-git-send-email-will.deacon@arm.com

Cc: Jason Baron <jbaron@redhat.com>
Cc: Wade Farnsworth <wade_farnsworth@mentor.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

60916a93

15 8月, 2012 4 次提交

time: Improve sanity checking of timekeeping inputs · 4e8b1452

由 John Stultz 提交于 8月 08, 2012

Unexpected behavior could occur if the time is set to a value large
enough to overflow a 64bit ktime_t (which is something larger then the
year 2262).

Also unexpected behavior could occur if large negative offsets are
injected via adjtimex.

So this patch improves the sanity check timekeeping inputs by
improving the timespec_valid() check, and then makes better use of
timespec_valid() to make sure we don't set the time to an invalid
negative value or one that overflows ktime_t.

Note: This does not protect from setting the time close to overflowing
ktime_t and then letting natural accumulation cause the overflow.
Reported-by: NCAI Qian <caiqian@redhat.com>
Reported-by: NSasha Levin <levinsasha928@gmail.com>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

4e8b1452

audit: clean up refcounting in audit-tree · b3e8692b

由 Miklos Szeredi 提交于 8月 15, 2012

Drop the initial reference by fsnotify_init_mark early instead of
audit_tree_freeing_mark() at destroy time.

In the cases we destroy the mark before we drop the initial reference we need to
get rid of the get_mark that balances the put_mark in audit_tree_freeing_mark().
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

b3e8692b

audit: fix refcounting in audit-tree · a2140fc0

由 Miklos Szeredi 提交于 8月 15, 2012

Refcounting of fsnotify_mark in audit tree is broken.  E.g:

                              refcount
create_chunk
  alloc_chunk                 1
  fsnotify_add_mark           2

untag_chunk
  fsnotify_get_mark           3
  fsnotify_destroy_mark
    audit_tree_freeing_mark   2
  fsnotify_put_mark           1
  fsnotify_put_mark           0
  via destroy_list
    fsnotify_mark_destroy    -1

This was reported by various people as triggering Oops when stopping auditd.

We could just remove the put_mark from audit_tree_freeing_mark() but that would
break freeing via inode destruction.  So this patch simply omits a put_mark
after calling destroy_mark or adds a get_mark before.

The additional get_mark is necessary where there's no other put_mark after
fsnotify_destroy_mark() since it assumes that the caller is holding a reference
(or the inode is keeping the mark pinned, not the case here AFAICS).
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Reported-by: NValentin Avram <aval13@gmail.com>
Reported-by: NPeter Moody <pmoody@google.com>
Acked-by: NEric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org

a2140fc0

audit: don't free_chunk() after fsnotify_add_mark() · 0fe33aae

由 Miklos Szeredi 提交于 8月 15, 2012

Don't do free_chunk() after fsnotify_add_mark().  That one does a delayed unref
via the destroy list and this results in use-after-free.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Acked-by: NEric Paris <eparis@redhat.com>
CC: stable@vger.kernel.org

0fe33aae

14 8月, 2012 5 次提交

sched: Fix migration thread runtime bogosity · 8f618968

由 Mike Galbraith 提交于 8月 04, 2012

Make stop scheduler class do the same accounting as other classes,

Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start.  The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.

%CPU   PID USER     CMD
99.3    45 root     [migration/10]
97.7    53 root     [migration/12]
97.0    57 root     [migration/13]
90.1    49 root     [migration/11]
89.6    65 root     [migration/15]
88.7    17 root     [migration/3]
80.4    37 root     [migration/8]
78.1    41 root     [migration/9]
44.2    13 root     [migration/2]
Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

8f618968

sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled · e221d028

由 Mike Galbraith 提交于 8月 07, 2012

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

e221d028

sched,cgroup: Fix up task_groups list · 35cf4e50

由 Mike Galbraith 提交于 8月 07, 2012

With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance.  This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.3+
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

35cf4e50

sched: fix divide by zero at {thread_group,task}_times · bea6832c

由 Stanislaw Gruszka 提交于 8月 08, 2012

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Cc: stable@vger.kernel.org
Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

bea6832c

sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466

由 Peter Zijlstra 提交于 8月 08, 2012

Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: NPeter Portante <pportant@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

a35b6466

13 8月, 2012 1 次提交

printk: Fix calculation of length used to discard records · e3756477

由 Jeff Mahoney 提交于 8月 10, 2012

While tracking down a weird buffer overflow issue in a program that
looked to be sane, I started double checking the length returned by
syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
the buffer.

Sure enough, it was.  I saw this in strace:

  11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279

It turns out that the loops that calculate how much space the entries
will take when they're copied don't include the newlines and prefixes
that will be included in the final output since prev flags is passed as
zero.

This patch properly accounts for it and fixes the overflow.

CC: stable@kernel.org
Signed-off-by: NJeff Mahoney <jeffm@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e3756477

09 8月, 2012 1 次提交

Revert "NMI watchdog: fix for lockup detector breakage on resume" · 300d3739

由 Rafael J. Wysocki 提交于 8月 07, 2012

Revert commit 45226e94 (NMI watchdog: fix for lockup detector breakage
on resume) which breaks resume from system suspend on my SH7372
Mackerel board (by causing a NULL pointer dereference to happen) and
is generally wrong, because it abuses the CPU hotplug functionality
in a shamelessly blatant way.

The original issue should be addressed through appropriate syscore
resume callback instead.
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>

300d3739

05 8月, 2012 1 次提交

time: Fix adjustment cleanup bug in timekeeping_adjust() · 1d17d174

由 Ingo Molnar 提交于 8月 04, 2012

Tetsuo Handa reported that sporadically the system clock starts
counting up too quickly which is enough to confuse the hangcheck
timer to print a bogus stall warning.

Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling
timekeeping_adjust" overlooked this exit path:

        } else
                return;

which should really be a proper exit sequence, fixing the bug as a
side effect.

Also make the flow more readable by properly balancing curly
braces.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Cc: john.stultz@linaro.org
Cc: a.p.zijlstra@chello.nl
Cc: richardcochran@gmail.com
Cc: prarit@redhat.com
Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1d17d174

01 8月, 2012 5 次提交

mm: allow PF_MEMALLOC from softirq context · 907aed48

由 Mel Gorman 提交于 7月 31, 2012

This is needed to allow network softirq packet processing to make use of
PF_MEMALLOC.

Currently softirq context cannot use PF_MEMALLOC due to it not being
associated with a task, and therefore not having task flags to fiddle with
- thus the gfp to alloc flag mapping ignores the task flags when in
interrupts (hard or soft) context.

Allowing softirqs to make use of PF_MEMALLOC therefore requires some
trickery.  This patch borrows the task flags from whatever process happens
to be preempted by the softirq.  It then modifies the gfp to alloc flags
mapping to not exclude task flags in softirq context, and modify the
softirq code to save, clear and restore the PF_MEMALLOC flag.

The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
cannot leak back into the preempted process.  This should be safe due to
the following reasons

Softirqs can run on multiple CPUs sure but the same task should not be
	executing the same softirq code. Neither should the softirq
	handler be preempted by any other softirq handler so the flags
	should not leak to an unrelated softirq.

Softirqs re-enable hardware interrupts in __do_softirq() so can be
	preempted by hardware interrupts so PF_MEMALLOC is inherited
	by the hard IRQ. However, this is similar to a process in
	reclaim being preempted by a hardirq. While PF_MEMALLOC is
	set, gfp_to_alloc_flags() distinguishes between hard and
	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
	flag.

If the softirq is deferred to ksoftirq then its flags may be used
        instead of a normal tasks but as the softirq cannot be preempted,
        the PF_MEMALLOC flag does not leak to other code by accident.

[davem@davemloft.net: Document why PF_MEMALLOC is safe]
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Cc: David Miller <davem@davemloft.net>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

907aed48

mm/hotplug: correctly setup fallback zonelists when creating new pgdat · 9adb62a5

由 Jiang Liu 提交于 7月 31, 2012

When hotadd_new_pgdat() is called to create new pgdat for a new node, a
fallback zonelist should be created for the new node.  There's code to try
to achieve that in hotadd_new_pgdat() as below:

	/*
	 * The node we allocated has no zone fallback lists. For avoiding
	 * to access not-initialized zonelist, build here.
	 */
	mutex_lock(&zonelists_mutex);
	build_all_zonelists(pgdat, NULL);
	mutex_unlock(&zonelists_mutex);

But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
new node is still in offline state because node_set_online(nid) hasn't
been called yet.  And build_all_zonelists() only builds zonelists for
online nodes as:

        for_each_online_node(nid) {
                pg_data_t *pgdat = NODE_DATA(nid);

                build_zonelists(pgdat);
                build_zonelist_cache(pgdat);
        }

Though we hope to create zonelist for the new pgdat, but it doesn't.  So
add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
the new pgdat too.
Signed-off-by: NJiang Liu <liuj97@gmail.com>
Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Keping Chen <chenkeping@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9adb62a5

memcg: rename config variables · c255a458

由 Andrew Morton 提交于 7月 31, 2012

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa <glommer@parallels.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c255a458

mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads · 3965c9ae

由 Wanpeng Li 提交于 7月 31, 2012

Since per-BDI flusher threads were introduced in 2.6, the pdflush
mechanism is not used any more.  But the old interface exported through
/proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.

For back-compatibility, printk warning information and return 2 to notify
the users that the interface is removed.
Signed-off-by: NWanpeng Li <liwp@linux.vnet.ibm.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3965c9ae

mm: account the total_vm in the vm_stat_account() · 44de9d0c

由 Huang Shijie 提交于 7月 31, 2012

vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
But we can also account for total_vm in the vm_stat_account() which makes
the code tidy.

Even for mprotect_fixup(), we can get the right result in the end.
Signed-off-by: NHuang Shijie <shijie8@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

44de9d0c

31 7月, 2012 12 次提交

time: Remove all direct references to timekeeper · 4e250fdd

由 John Stultz 提交于 7月 27, 2012

Ingo noted that the numerous timekeeper.value references made
the timekeeping code ugly and caused many long lines that
had to be broken up. He recommended replacing timekeeper.value
references with tk->value.

This patch provides a local tk value for all top level time
functions and sets it to &timekeeper. Then all timekeeper
access is done via a tk pointer.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-6-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

4e250fdd

time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates · 6d0ef903

由 John Stultz 提交于 7月 27, 2012

For performance reasons, we maintain ktime_t based duplicates of
wall_to_monotonic (offs_real) and total_sleep_time (offs_boot).

Since large problems could occur (such as the resume regression
on 3.5-rc7, or the leapsecond hrtimer issue) if these value
pairs were to be inconsistently updated, this patch this cleans
up how we modify these value pairs to ensure we are always
consistent.

As a side-effect this is also more efficient as we only
caulculate the duplicate values when they are changed,
rather then every update_wall_time call.

This also provides WARN_ONs to detect if future changes break
the invariants.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-5-git-send-email-john.stultz@linaro.org
[ Cleaned up minor style issues. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

6d0ef903

time: Clean up stray newlines · d4e3ab38

由 John Stultz 提交于 7月 27, 2012

Ingo noted inconsistent newline usage between functions.
This patch cleans those up.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

d4e3ab38

time/jiffies: Rename ACTHZ to SHIFTED_HZ · 02ab20ae

由 John Stultz 提交于 7月 27, 2012

Ingo noted that ACTHZ is a confusing name, and requested it
be renamed, so this patch renames ACTHZ to SHIFTED_HZ to
better describe it.
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Link: http://lkml.kernel.org/r/1343414893-45779-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

02ab20ae

perf/trace: Add ability to set a target task for events · e6dab5ff

由 Andrew Vagin 提交于 7月 11, 2012

A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.

Now a target task can be set by using __perf_task().

The original idea and a draft patch belongs to Peter Zijlstra.

I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Inspired-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Signed-off-by: NAndrew Vagin <avagin@openvz.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

e6dab5ff

sched/cleanups: Add load balance cpumask pointer to 'struct lb_env' · b9403130

由 Michael Wang 提交于 7月 12, 2012

With this patch struct ld_env will have a pointer of the load balancing
cpumask and we don't need to pass a cpumask around anymore.
Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b9403130

kernel/debug: Make use of KGDB_REASON_NMI · b10d22d6

由 Anton Vorontsov 提交于 7月 30, 2012

Currently kernel never set KGDB_REASON_NMI. We do now, when we enter
KGDB/KDB from an NMI.

This is not to be confused with kgdb_nmicallback(), NMI callback is
an entry for the slave CPUs during CPUs roundup, but REASON_NMI is the
entry for the master CPU.
Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

b10d22d6

kdb: Remove cpu from the more prompt · 07cd27bb

由 Jason Wessel 提交于 7月 30, 2012

Having the CPU in the more prompt is completely redundent vs the
standard kdb prompt, and it also wastes 32 bytes on the stack.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

07cd27bb

kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP · 0f26d0e0

由 Jason Wessel 提交于 7月 30, 2012

This code cleanup was missed in the original kdb merge, and this code
is simply not used at all. The code that was previously used to set
the KDB_FLAG_ONLY_DO_DUMP was removed prior to the initial kdb merge.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

0f26d0e0

resource: make sure requested range is included in the root range · 65fed8f6

由 Octavian Purdila 提交于 7月 30, 2012

When the requested range is outside of the root range the logic in
__reserve_region_with_split will cause an infinite recursion which will
overflow the stack as seen in the warning bellow.

This particular stack overflow was caused by requesting the
(100000000-107ffffff) range while the root range was (0-ffffffff).  In
this case __request_resource would return the whole root range as
conflict range (i.e.  0-ffffffff).  Then, the logic in
__reserve_region_with_split would continue the recursion requesting the
new range as (conflict->end+1, end) which incidentally in this case
equals the originally requested range.

This patch aborts looking for an usable range when the request does not
intersect with the root range.  When the request partially overlaps with
the root range, it ajust the request to fall in the root range and then
continues with the new request.

When the request is modified or aborted errors and a stack trace are
logged to allow catching the errors in the upper layers.

[    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
[    5.975150] Modules linked in:
[    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
[    5.985324] Call Trace:
[    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
[    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
[    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
[    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
[    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
[    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
[    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
[    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
[    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
[    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
[    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
[    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
[    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
[    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
...
[    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
[    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
[    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
[    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
[    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
[    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
[    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
[    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
[    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
[    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
[    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10
Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
Signed-off-by: NRam Pai <linuxram@us.ibm.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

65fed8f6

taskstats: check nla_reserve() return · 25353b33

由 Alan Cox 提交于 7月 30, 2012

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621

Reported-by: <rucsoftsec@gmail.com>
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

25353b33

sysctl: suppress kmemleak messages · fd4b616b

由 Steven Rostedt 提交于 7月 30, 2012

register_sysctl_table() is a strange function, as it makes internal
allocations (a header) to register a sysctl_table.  This header is a
handle to the table that is created, and can be used to unregister the
table.  But if the table is permanent and never unregistered, the header
acts the same as a static variable.

Unfortunately, this allocation of memory that is never expected to be
freed fools kmemleak in thinking that we have leaked memory.  For those
sysctl tables that are never unregistered, and have no pointer referencing
them, kmemleak will think that these are memory leaks:

unreferenced object 0xffff880079fb9d40 (size 192):
  comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
    [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
    [<ffffffff8110b852>] __kmalloc+0x107/0x153
    [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
    [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
    [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
    [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
    [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
    [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
    [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
    [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
    [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
    [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
    [<ffffffffffffffff>] 0xffffffffffffffff

The sysctl_base_table used by sysctl itself is one such instance that
registers the table to never be unregistered.

Use kmemleak_not_leak() to suppress the kmemleak false positive.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fd4b616b

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功