提交 · d8a0349c0cea477322c66ea9362f10c62fad5f62 · openanolis / cloud-kernel

22 1月, 2013 3 次提交

tracing: Use this_cpu_ptr per-cpu helper · d8a0349c

由 Shan Wei 提交于 11月 13, 2012

typeof(&buffer) is a pointer to array of 1024 char, or char (*)[1024].
But, typeof(&buffer[0]) is a pointer to char which match the return type of get_trace_buf().
As well-known, the value of &buffer is equal to &buffer[0].
so return this_cpu_ptr(&percpu_buffer->buffer[0]) can avoid type cast.

Link: http://lkml.kernel.org/r/50A1A800.3020102@gmail.comReviewed-by: NChristoph Lameter <cl@linux.com>
Signed-off-by: NShan Wei <davidshan@tencent.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

d8a0349c

ring-buffer: Remove unnecessary recusive call in rb_advance_iter() · 771e0384

由 Steven Rostedt 提交于 11月 30, 2012

The original ring-buffer code had special checks at the start
of rb_advance_iter() and instead of repeating them again at the
end of the function if a certain condition existed, I just did
a recursive call to rb_advance_iter() because the special condition
would cause rb_advance_iter() to return early (after the checks).

But as things have changed, the special checks no longer exist
and the only thing done for the special_condition is to call
rb_inc_iter() and return. Instead of doing a confusing recursive call,
just call rb_inc_iter instead.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

771e0384

ftrace: Be first to run code modification on modules · c1bf08ac

由 Steven Rostedt 提交于 12月 14, 2012

If some other kernel subsystem has a module notifier, and adds a kprobe
to a ftrace mcount point (now that kprobes work on ftrace points),
when the ftrace notifier runs it will fail and disable ftrace, as well
as kprobes that are attached to ftrace points.

Here's the error:

 WARNING: at kernel/trace/ftrace.c:1618 ftrace_bug+0x239/0x280()
 Hardware name: Bochs
 Modules linked in: fat(+) stap_56d28a51b3fe546293ca0700b10bcb29__8059(F) nfsv4 auth_rpcgss nfs dns_resolver fscache xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack lockd sunrpc ppdev parport_pc parport microcode virtio_net i2c_piix4 drm_kms_helper ttm drm i2c_core [last unloaded: bid_shared]
 Pid: 8068, comm: modprobe Tainted: GF            3.7.0-0.rc8.git0.1.fc19.x86_64 #1
 Call Trace:
  [<ffffffff8105e70f>] warn_slowpath_common+0x7f/0xc0
  [<ffffffff81134106>] ? __probe_kernel_read+0x46/0x70
  [<ffffffffa0180000>] ? 0xffffffffa017ffff
  [<ffffffffa0180000>] ? 0xffffffffa017ffff
  [<ffffffff8105e76a>] warn_slowpath_null+0x1a/0x20
  [<ffffffff810fd189>] ftrace_bug+0x239/0x280
  [<ffffffff810fd626>] ftrace_process_locs+0x376/0x520
  [<ffffffff810fefb7>] ftrace_module_notify+0x47/0x50
  [<ffffffff8163912d>] notifier_call_chain+0x4d/0x70
  [<ffffffff810882f8>] __blocking_notifier_call_chain+0x58/0x80
  [<ffffffff81088336>] blocking_notifier_call_chain+0x16/0x20
  [<ffffffff810c2a23>] sys_init_module+0x73/0x220
  [<ffffffff8163d719>] system_call_fastpath+0x16/0x1b
 ---[ end trace 9ef46351e53bbf80 ]---
 ftrace failed to modify [<ffffffffa0180000>] init_once+0x0/0x20 [fat]
  actual: cc:bb:d2:4b:e1

A kprobe was added to the init_once() function in the fat module on load.
But this happened before ftrace could have touched the code. As ftrace
didn't run yet, the kprobe system had no idea it was a ftrace point and
simply added a breakpoint to the code (0xcc in the cc:bb:d2:4b:e1).

Then when ftrace went to modify the location from a call to mcount/fentry
into a nop, it didn't see a call op, but instead it saw the breakpoint op
and not knowing what to do with it, ftrace shut itself down.

The solution is to simply give the ftrace module notifier the max priority.
This should have been done regardless, as the core code ftrace modification
also happens very early on in boot up. This makes the module modification
closer to core modification.

Link: http://lkml.kernel.org/r/20130107140333.593683061@goodmis.org

Cc: stable@vger.kernel.org
Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Reported-by: NFrank Ch. Eigler <fche@redhat.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

c1bf08ac

17 1月, 2013 1 次提交

module, async: async_synchronize_full() on module init iff async is used · 774a1221

由 Tejun Heo 提交于 1月 15, 2013

If the default iosched is built as module, the kernel may deadlock
while trying to load the iosched module on device probe if the probing
was running off async.  This is because async_synchronize_full() at
the end of module init ends up waiting for the async job which
initiated the module loading.

 async A				modprobe

 1. finds a device
 2. registers the block device
 3. request_module(default iosched)
					4. modprobe in userland
					5. load and init module
					6. async_synchronize_full()

Async A waits for modprobe to finish in request_module() and modprobe
waits for async A to finish in async_synchronize_full().

Because there's no easy to track dependency once control goes out to
userland, implementing properly nested flushing is difficult.  For
now, make module init perform async_synchronize_full() iff module init
has queued async jobs as suggested by Linus.

This avoids the described deadlock because iosched module doesn't use
async and thus wouldn't invoke async_synchronize_full().  This is
hacky and incomplete.  It will deadlock if async module loading nests;
however, this works around the known problem case and seems to be the
best of bad options.

For more details, please refer to the following thread.

  http://thread.gmane.org/gmane.linux.kernel/1420814Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NAlex Riesen <raa.lkml@gmail.com>
Tested-by: NMing Lei <ming.lei@canonical.com>
Tested-by: NAlex Riesen <raa.lkml@gmail.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

774a1221

15 1月, 2013 1 次提交

tracing: Fix regression of trace_pipe · 250bfd3d

由 Liu Bo 提交于 1月 14, 2013

Commit 0fb9656d "tracing: Make tracing_enabled be equal to tracing_on"
changes the behaviour of trace_pipe, ie. it makes trace_pipe return if
we've read something and tracing is enabled, and this means that we have
to 'cat trace_pipe' again and again while running tests.

IMO the right way is if tracing is enabled, we always block and wait for
ring buffer, or we may lose what we want since ring buffer's size is limited.

Link: http://lkml.kernel.org/r/1358132051-5410-1-git-send-email-bo.li.liu@oracle.comSigned-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

250bfd3d

12 1月, 2013 5 次提交

kernel/audit.c: avoid negative sleep durations · 82919919

由 Andrew Morton 提交于 1月 11, 2013

audit_log_start() performs the same jiffies comparison in two places.
If sufficient time has elapsed between the two comparisons, the second
one produces a negative sleep duration:

  schedule_timeout: wrong timeout value fffffffffffffff0
  Pid: 6606, comm: trinity-child1 Not tainted 3.8.0-rc1+ #43
  Call Trace:
    schedule_timeout+0x305/0x340
    audit_log_start+0x311/0x470
    audit_log_exit+0x4b/0xfb0
    __audit_syscall_exit+0x25f/0x2c0
    sysret_audit+0x17/0x21

Fix it by performing the comparison a single time.
Reported-by: NDave Jones <davej@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

82919919

audit: catch possible NULL audit buffers · 0644ec0c

由 Kees Cook 提交于 1月 11, 2013

It's possible for audit_log_start() to return NULL.  Handle it in the
various callers.
Signed-off-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Julien Tinnes <jln@google.com>
Cc: Will Drewry <wad@google.com>
Cc: Steve Grubb <sgrubb@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0644ec0c

audit: create explicit AUDIT_SECCOMP event type · 7b9205bd

由 Kees Cook 提交于 1月 11, 2013

The seccomp path was using AUDIT_ANOM_ABEND from when seccomp mode 1
could only kill a process.  While we still want to make sure an audit
record is forced on a kill, this should use a separate record type since
seccomp mode 2 introduces other behaviors.

In the case of "handled" behaviors (process wasn't killed), only emit a
record if the process is under inspection.  This change also fixes
userspace examination of seccomp audit events, since it was considered
malformed due to missing fields of the AUDIT_ANOM_ABEND event type.
Signed-off-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Julien Tinnes <jln@google.com>
Acked-by: NWill Drewry <wad@chromium.org>
Acked-by: NSteve Grubb <sgrubb@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7b9205bd

lockdep, rwsem: provide down_write_nest_lock() · 1b963c81

由 Jiri Kosina 提交于 1月 11, 2013

down_write_nest_lock() provides a means to annotate locking scenario
where an outer lock is guaranteed to serialize the order nested locks
are being acquired.

This is analogoue to already existing mutex_lock_nest_lock() and
spin_lock_nest_lock().
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1b963c81

tracing: Fix regression with irqsoff tracer and tracing_on file · 2df8f8a6

由 Steven Rostedt 提交于 1月 11, 2013

Commit 02404baf "tracing: Remove deprecated tracing_enabled file"
removed the tracing_enabled file as it never worked properly and
the tracing_on file should be used instead. But the tracing_on file
didn't call into the tracers start/stop routines like the
tracing_enabled file did. This caused trace-cmd to break when it
enabled the irqsoff tracer.

If you just did "echo irqsoff > current_tracer" then it would work
properly. But the tool trace-cmd disables tracing first by writing
"0" into the tracing_on file. Then it writes "irqsoff" into
current_tracer and then writes "1" into tracing_on. Unfortunately,
the above commit changed the irqsoff tracer to check the tracing_on
status instead of the tracing_enabled status. If it's disabled then
it does not start the tracer internals.

The problem is that writing "1" into tracing_on does not call the
tracers "start" routine like writing "1" into tracing_enabled did.
This makes the irqsoff tracer not start when using the trace-cmd
tool, and is a regression for userspace.

Simple fix is to have the tracing_on file call the tracers start()
method when being enabled (and the stop() method when disabled).
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

2df8f8a6

11 1月, 2013 1 次提交

audit: fix auditfilter.c kernel-doc warnings · bfbbd96c

由 Randy Dunlap 提交于 1月 09, 2013

Fix new kernel-doc warning in auditfilter.c:

  Warning(kernel/auditfilter.c:1157): Excess function parameter 'uid' description in 'audit_receive_filter'
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: linux-audit@redhat.com (subscribers-only)
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bfbbd96c

10 1月, 2013 1 次提交

tracing: Fix regression of trace_options file setting · a8dd2176

由 Steven Rostedt 提交于 1月 09, 2013

The latest change to allow trace options to be set on the command
line also broke the trace_options file.

The zeroing of the last byte of the option name that is echoed into
the trace_option file was removed with the consolidation of some
of the code. The compare between the option and what was written to
the trace_options file fails because the string holding the data
written doesn't terminate with a null character.

A zero needs to be added to the end of the string copied from
user space.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

a8dd2176

06 1月, 2013 2 次提交

signals: set_current_blocked() can use __set_current_blocked() · 0c4a8423

由 Oleg Nesterov 提交于 1月 05, 2013

Cleanup.  And I think we need more cleanups, in particular
__set_current_blocked() and sigprocmask() should die.  Nobody should
ever block SIGKILL or SIGSTOP.

 - Change set_current_blocked() to use __set_current_blocked()

 - Change sys_sigprocmask() to use set_current_blocked(), this way it
   should not worry about SIGKILL/SIGSTOP.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0c4a8423

signals: sys_ssetmask() uses uninitialized newmask · 5ba53ff6

由 Oleg Nesterov 提交于 1月 05, 2013

Commit 77097ae5 ("most of set_current_blocked() callers want
SIGKILL/SIGSTOP removed from set") removed the initialization of newmask
by accident, causing ltp to complain like this:

  ssetmask01    1  TFAIL  :  sgetmask() failed: TEST_ERRNO=???(0): Success

Restore the proper initialization.
Reported-and-tested-by: NCAI Qian <caiqian@redhat.com>
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: stable@kernel.org	# v3.5+
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ba53ff6

05 1月, 2013 1 次提交

printk: fix incorrect length from print_time() when seconds > 99999 · 35dac27c

由 Roland Dreier 提交于 1月 04, 2013

print_prefix() passes a NULL buf to print_time() to get the length of
the time prefix; when printk times are enabled, the current code just
returns the constant 15, which matches the format "[%5lu.%06lu] " used
to print the time value.  However, this is obviously incorrect when the
whole seconds part of the time gets beyond 5 digits (100000 seconds is a
bit more than a day of uptime).

The simple fix is to use snprintf(NULL, 0, ...) to calculate the actual
length of the time prefix.  This could be micro-optimized but it seems
better to have simpler, more readable code here.

The bug leads to the syslog system call miscomputing which messages fit
into the userspace buffer.  If there are enough messages to fill
log_buf_len and some have a timestamp >= 100000, dmesg may fail with:

    # dmesg
    klogctl: Bad address

When this happens, strace shows that the failure is indeed EFAULT due to
the kernel mistakenly accessing past the end of dmesg's buffer, since
dmesg asks the kernel how big a buffer it needs, allocates a bit more,
and then gets an error when it asks the kernel to fill it:

    syslog(0xa, 0, 0)                       = 1048576
    mmap(NULL, 1052672, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa4d25d2000
    syslog(0x3, 0x7fa4d25d2010, 0x100008)   = -1 EFAULT (Bad address)

As far as I can see, the bug has been there as long as print_time(),
which comes from commit 084681d1 ("printk: flush continuation lines
immediately to console") in 3.5-rc5.
Signed-off-by: NRoland Dreier <roland@purestorage.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Joe Perches <joe@perches.com>
Cc: Sylvain Munaut <s.munaut@whatever-company.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

35dac27c

26 12月, 2012 1 次提交

pidns: Stop pid allocation when init dies · c876ad76

由 Eric W. Biederman 提交于 12月 21, 2012

Oleg pointed out that in a pid namespace the sequence.
- pid 1 becomes a zombie
- setns(thepidns), fork,...
- reaping pid 1.
- The injected processes exiting.

Can lead to processes attempting access their child reaper and
instead following a stale pointer.

That waitpid for init can return before all of the processes in
the pid namespace have exited is also unfortunate.

Avoid these problems by disabling the allocation of new pids in a pid
namespace when init dies, instead of when the last process in a pid
namespace is reaped.
Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

c876ad76

25 12月, 2012 1 次提交

pidns: Outlaw thread creation after unshare(CLONE_NEWPID) · 8382fcac

由 Eric W. Biederman 提交于 12月 20, 2012

The sequence:
unshare(CLONE_NEWPID)
clone(CLONE_THREAD|CLONE_SIGHAND|CLONE_VM)

Creates a new process in the new pid namespace without setting
pid_ns->child_reaper.  After forking this results in a NULL
pointer dereference.

Avoid this and other nonsense scenarios that can show up after
creating a new pid namespace with unshare by adding a new
check in copy_prodcess.
Pointed-out-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

8382fcac

21 12月, 2012 2 次提交

keys: use keyring_alloc() to create module signing keyring · cfde8190

由 David Howells 提交于 12月 20, 2012

Use keyring_alloc() to create special keyrings now that it has
a permissions parameter rather than using key_alloc() +
key_instantiate_and_link().
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cfde8190

kcmp: include linux/ptrace.h · 44fd07e9

由 Cyrill Gorcunov 提交于 12月 20, 2012

This makes it compile on s390. After all the ptrace_may_access
(which we use this file) is declared exactly in linux/ptrace.h.

This is preparatory work to wire this syscall up on all archs.
Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NAlexander Kartashov <alekskartashov@parallels.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

44fd07e9

20 12月, 2012 7 次提交

sched: numa: ksm: fix oops in task_numa_placment() · 2832bc19

由 Hugh Dickins 提交于 12月 19, 2012

task_numa_placement() oopsed on NULL p->mm when task_numa_fault() got
called in the handling of break_ksm() for ksmd.  That might be a
peculiar case, which perhaps KSM could takes steps to avoid? but it's
more robust if task_numa_placement() allows for such a possibility.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2832bc19

A
new helpers: __save_altstack/__compat_save_altstack, switch x86 and um to those · c40702c4
由 Al Viro 提交于 11月 20, 2012
```
note that they are relying on access_ok() already checked by caller.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
c40702c4

generic compat_sys_sigaltstack() · 90268439

由 Al Viro 提交于 12月 14, 2012

Again, conditional on CONFIG_GENERIC_SIGALTSTACK
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

90268439

introduce generic sys_sigaltstack(), switch x86 and um to it · 6bf9adfc

由 Al Viro 提交于 12月 14, 2012

Conditional on CONFIG_GENERIC_SIGALTSTACK; architectures that do not
select it are completely unaffected
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6bf9adfc

new helper: restore_altstack() · 5c49574f

由 Al Viro 提交于 11月 18, 2012

to be used by rt_sigreturn instances
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5c49574f

Bury the conditionals from kernel_thread/kernel_execve series · ae903caa

由 Al Viro 提交于 12月 14, 2012

All architectures have
	CONFIG_GENERIC_KERNEL_THREAD
	CONFIG_GENERIC_KERNEL_EXECVE
	__ARCH_WANT_SYS_EXECVE
None of them have __ARCH_WANT_KERNEL_EXECVE and there are only two callers
of kernel_execve() (which is a trivial wrapper for do_execve() now) left.
Kill the conditionals and make both callers use do_execve().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ae903caa

watchdog: Fix disable/enable regression · 3935e895

由 Bjørn Mork 提交于 12月 19, 2012

Commit 8d451690 ("watchdog: Fix CPU hotplug regression") causes an
oops or hard lockup when doing

 echo 0 > /proc/sys/kernel/nmi_watchdog
 echo 1 > /proc/sys/kernel/nmi_watchdog

and the kernel is booted with nmi_watchdog=1 (default)

Running laptop-mode-tools and disconnecting/connecting AC power will
cause this to trigger, making it a common failure scenario on laptops.

Instead of bailing out of watchdog_disable() when !watchdog_enabled we
can initialize the hrtimer regardless of watchdog_enabled status.  This
makes it safe to call watchdog_disable() in the nmi_watchdog=0 case,
without the negative effect on the enabled => disabled => enabled case.

All these tests pass with this patch:
- nmi_watchdog=1
  echo 0 > /proc/sys/kernel/nmi_watchdog
  echo 1 > /proc/sys/kernel/nmi_watchdog

- nmi_watchdog=0
  echo 0 > /sys/devices/system/cpu/cpu1/online

- nmi_watchdog=0
  echo mem > /sys/power/state

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661

Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBjørn Mork <bjorn@mork.no>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3935e895

19 12月, 2012 3 次提交

fork: protect architectures where THREAD_SIZE >= PAGE_SIZE against fork bombs · 2ad306b1

由 Glauber Costa 提交于 12月 18, 2012

Because those architectures will draw their stacks directly from the page
allocator, rather than the slab cache, we can directly pass __GFP_KMEMCG
flag, and issue the corresponding free_pages.

This code path is taken when the architecture doesn't define
CONFIG_ARCH_THREAD_INFO_ALLOCATOR (only ia64 seems to), and has
THREAD_SIZE >= PAGE_SIZE.  Luckily, most - if not all - of the remaining
architectures fall in this category.

This will guarantee that every stack page is accounted to the memcg the
process currently lives on, and will have the allocations to fail if they
go over limit.

For the time being, I am defining a new variant of THREADINFO_GFP, not to
mess with the other path.  Once the slab is also tracked by memcg, we can
get rid of that flag.

Tested to successfully protect against :(){ :|:& };:
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Acked-by: NFrederic Weisbecker <fweisbec@redhat.com>
Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2ad306b1

res_counter: return amount of charges after res_counter_uncharge() · 50bdd430

由 Glauber Costa 提交于 12月 18, 2012

It is useful to know how many charges are still left after a call to
res_counter_uncharge.  While it is possible to issue a res_counter_read
after uncharge, this can be racy.

If we need, for instance, to take some action when the counters drop down
to 0, only one of the callers should see it.  This is the same semantics
as the atomic variables in the kernel.

Since the current return value is void, we don't need to worry about
anything breaking due to this change: nobody relied on that, and only
users appearing from now on will be checking this value.
Signed-off-by: NGlauber Costa <glommer@parallels.com>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: JoonSoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

50bdd430

irq: tsk->comm is an array · 19af395d

由 Alan Cox 提交于 12月 18, 2012

The array check is useless so remove it.

[akpm@linux-foundation.org: remove comment, per David]
Signed-off-by: NAlan Cox <alan@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

19af395d

18 12月, 2012 9 次提交

pidns: remove unused is_container_init() · a5ba911e

由 Gao feng 提交于 12月 17, 2012

Since commit 1cdcbec1 ("CRED: Neuter sys_capset()")
is_container_init() has no callers.
Signed-off-by: NGao feng <gaofeng@cn.fujitsu.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a5ba911e

ptrace: introduce PTRACE_O_EXITKILL · 992fb6e1

由 Oleg Nesterov 提交于 12月 17, 2012

Ptrace jailers want to be sure that the tracee can never escape
from the control. However if the tracer dies unexpectedly the
tracee continues to run in potentially unsafe mode.

Add the new ptrace option PTRACE_O_EXITKILL. If the tracer exits
it sends SIGKILL to every tracee which has this bit set.

Note that the new option is not equal to the last-option << 1.  Because
currently all options have an event, and the new one starts the eventless
group.  It uses the random 20 bit, so we have the room for 12 more events,
but we can also add the new eventless options below this one.

Suggested by Amnon Shiloh.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Tested-by: NAmnon Shiloh <u3557@miso.sublimeip.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Chris Evans <scarybeasts@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

992fb6e1

compat: generic compat_sys_sched_rr_get_interval() implementation · 0ad50c38

由 Catalin Marinas 提交于 12月 17, 2012

This function is used by sparc, powerpc tile and arm64 for compat support.
 The patch adds a generic implementation with a wrapper for PowerPC to do
the u32->int sign extension.

The reason for a single patch covering powerpc, tile, sparc and arm64 is
to keep it bisectable, otherwise kernel building may fail with mismatched
function declarations.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>  [for tile]
Acked-by: NDavid S. Miller <davem@davemloft.net>
Acked-by: NArnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ad50c38

trace: use kbasename() · b2e902f0

由 Andy Shevchenko 提交于 12月 17, 2012

Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b2e902f0

printk: boot_delay should only affect output · 2fa72c8f

由 Andrew Cooks 提交于 12月 17, 2012

The boot_delay parameter affects all printk(), even if the log level
prevents visible output from the call.  It results in delays greater than
the user intended without purpose.

This patch changes the behaviour of boot_delay to only delay output.
Signed-off-by: NAndrew Cooks <acooks@gmail.com>
Acked-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2fa72c8f

watchdog: store the watchdog sample period as a variable · 0f34c400

由 Chuansheng Liu 提交于 12月 17, 2012

Currently getting the sample period is always thru a complex
calculation: get_softlockup_thresh() * ((u64)NSEC_PER_SEC / 5).

We can store the sample period as a variable, and set it as __read_mostly
type.
Signed-off-by: Nliu chuansheng <chuansheng.liu@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0f34c400

lseek: the "whence" argument is called "whence" · 965c8e59

由 Andrew Morton 提交于 12月 17, 2012

But the kernel decided to call it "origin" instead.  Fix most of the
sites.
Acked-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

965c8e59

kernel: remove reference to feature-removal-schedule.txt · 8ec7d50f

由 Tao Ma 提交于 12月 17, 2012

In commit 9c0ece06 ("Get rid of Documentation/feature-removal.txt"),
Linus removed feature-removal-schedule.txt from Documentation, but there
is still some reference to this file.  So remove them.
Signed-off-by: NTao Ma <boyu.mt@taobao.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8ec7d50f

sched: numa: Fix build error if CONFIG_NUMA_BALANCING && !CONFIG_TRANSPARENT_HUGEPAGE · 221392c3

由 Mel Gorman 提交于 12月 17, 2012

Michal Hocko reported that the following build error occurs if
CONFIG_NUMA_BALANCING is set without THP support

  kernel/sched/fair.c: In function ‘task_numa_work’:
  kernel/sched/fair.c:932:55: error: call to ‘__build_bug_failed’ declared with attribute error: BUILD_BUG failed

The problem is that HPAGE_PMD_SHIFT triggers a BUILD_BUG() on
!CONFIG_TRANSPARENT_HUGEPAGE. This patch addresses the problem.
Reported-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

221392c3

17 12月, 2012 1 次提交

random: Mix cputime from each thread that exits to the pool · 61337054

由 Nick Kossifidis 提交于 12月 16, 2012

When a thread exits mix it's cputime (userspace + kernelspace) to the entropy pool.

We don't know how "random" this is, so we use add_device_randomness that doesn't mess
with entropy count.
Signed-off-by: NNick Kossifidis <mickflemm@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

61337054

15 12月, 2012 1 次提交
- E
  userns: Fix typo in description of the limitation of userns_install · 5155040e
  由 Eric W. Biederman 提交于 12月 09, 2012
```
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
```
  5155040e

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功