提交 · 8bc037fb89bb3104b9ae290d18c877624cd7d9cc · OpenHarmony / kernel_linux

17 3月, 2010 1 次提交

sched: Use proper type in sched_getaffinity() · 8bc037fb

由 KOSAKI Motohiro 提交于 3月 17, 2010

Using the proper type fixes the following compiler warning:

  kernel/sched.c:4850: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: torvalds@linux-foundation.org
Cc: travis@sgi.com
Cc: peterz@infradead.org
Cc: drepper@redhat.com
Cc: rja@sgi.com
Cc: sharyath@in.ibm.com
Cc: steiner@sgi.com
LKML-Reference: <20100317090046.4C79.A69D9226@jp.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8bc037fb

16 3月, 2010 1 次提交

kernel/sched.c: Suppress unused var warning · c890692b

由 Andrew Morton 提交于 3月 11, 2010

On UP:

  kernel/sched.c: In function 'wake_up_new_task':
  kernel/sched.c:2631: warning: unused variable 'cpu'
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c890692b

15 3月, 2010 1 次提交

sched: sched_getaffinity(): Allow less than NR_CPUS length · cd3d8031

由 KOSAKI Motohiro 提交于 3月 12, 2010

[ Note, this commit changes the syscall ABI for > 1024 CPUs systems. ]

Recently, some distro decided to use NR_CPUS=4096 for mysterious reasons.
Unfortunately, glibc sched interface has the following definition:

	# define __CPU_SETSIZE  1024
	# define __NCPUBITS     (8 * sizeof (__cpu_mask))
	typedef unsigned long int __cpu_mask;
	typedef struct
	{
	  __cpu_mask __bits[__CPU_SETSIZE / __NCPUBITS];
	} cpu_set_t;

It mean, if NR_CPUS is bigger than 1024, cpu_set_t makes an
ABI issue ...

More recently, Sharyathi Nagesh reported following test program makes
misterious syscall failure:

 -----------------------------------------------------------------------
 #define _GNU_SOURCE
 #include<stdio.h>
 #include<errno.h>
 #include<sched.h>

 int main()
 {
     cpu_set_t set;
     if (sched_getaffinity(0, sizeof(cpu_set_t), &set) < 0)
         printf("\n Call is failing with:%d", errno);
 }
 -----------------------------------------------------------------------

Because the kernel assumes len argument of sched_getaffinity() is bigger
than NR_CPUS. But now it is not correct.

Now we are faced with the following annoying dilemma, due to
the limitations of the glibc interface built in years ago:

 (1) if we change glibc's __CPU_SETSIZE definition, we lost
     binary compatibility of _all_ application.

 (2) if we don't change it, we also lost binary compatibility of
     Sharyathi's use case.

Then, I would propse to change the rule of the len argument of
sched_getaffinity().

Old:
	len should be bigger than NR_CPUS
New:
	len should be bigger than maximum possible cpu id

This creates the following behavior:

 (A) In the real 4096 cpus machine, the above test program still
     return -EINVAL.

 (B) NR_CPUS=4096 but the machine have less than 1024 cpus (almost
     all machines in the world), the above can run successfully.

Fortunatelly, BIG SGI machine is mainly used for HPC use case. It means
they can rebuild their programs.

IOW we hope they are not annoyed by this issue ...
Reported-by: NSharyathi Nagesh <sharyath@in.ibm.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: NUlrich Drepper <drepper@redhat.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <20100312161316.9520.A69D9226@jp.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cd3d8031

13 3月, 2010 31 次提交

tracing: Do not record user stack trace from NMI context · b6345879

由 Steven Rostedt 提交于 3月 12, 2010

A bug was found with Li Zefan's ftrace_stress_test that caused applications
to segfault during the test.

Placing a tracing_off() in the segfault code, and examining several
traces, I found that the following was always the case. The lock tracer
was enabled (lockdep being required) and userstack was enabled. Testing
this out, I just enabled the two, but that was not good enough. I needed
to run something else that could trigger it. Running a load like hackbench
did not work, but executing a new program would. The following would
trigger the segfault within seconds:

  # echo 1 > /debug/tracing/options/userstacktrace
  # echo 1 > /debug/tracing/events/lock/enable
  # while :; do ls > /dev/null ; done

Enabling the function graph tracer and looking at what was happening
I finally noticed that all cashes happened just after an NMI.

 1)               |    copy_user_handle_tail() {
 1)               |      bad_area_nosemaphore() {
 1)               |        __bad_area_nosemaphore() {
 1)               |          no_context() {
 1)               |            fixup_exception() {
 1)   0.319 us    |              search_exception_tables();
 1)   0.873 us    |            }
[...]
 1)   0.314 us    |  __rcu_read_unlock();
 1)   0.325 us    |    native_apic_mem_write();
 1)   0.943 us    |  }
 1)   0.304 us    |  rcu_nmi_exit();
[...]
 1)   0.479 us    |  find_vma();
 1)               |  bad_area() {
 1)               |    __bad_area() {

After capturing several traces of failures, all of them happened
after an NMI. Curious about this, I added a trace_printk() to the NMI
handler to read the regs->ip to see where the NMI happened. In which I
found out it was here:

ffffffff8135b660 <page_fault>:
ffffffff8135b660:       48 83 ec 78             sub    $0x78,%rsp
ffffffff8135b664:       e8 97 01 00 00          callq  ffffffff8135b800 <error_entry>

What was happening is that the NMI would happen at the place that a page
fault occurred. It would call rcu_read_lock() which was traced by
the lock events, and the user_stack_trace would run. This would trigger
a page fault inside the NMI. I do not see where the CR2 register is
saved or restored in NMI handling. This means that it would corrupt
the page fault handling that the NMI interrupted.

The reason the while loop of ls helped trigger the bug, was that
each execution of ls would cause lots of pages to be faulted in, and
increase the chances of the race happening.

The simple solution is to not allow user stack traces in NMI context.
After this patch, I ran the above "ls" test for a couple of hours
without any issues. Without this patch, the bug would trigger in less
than a minute.

Cc: stable@kernel.org
Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

b6345879

tracing: Disable buffer switching when starting or stopping trace · a2f80714

由 Steven Rostedt 提交于 3月 12, 2010

When the trace iterator is read, tracing_start() and tracing_stop()
is called to stop tracing while the iterator is processing the trace
output.

These functions disable both the standard buffer and the max latency
buffer. But if the wakeup tracer is running, it can switch these
buffers between the two disables:

  buffer = global_trace.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

      <<<--------- swap happens here

  buffer = max_tr.buffer;
  if (buffer)
      ring_buffer_record_disable(buffer);

What happens is that we disabled the same buffer twice. On tracing_start()
we can enable the same buffer twice. All ring_buffer_record_disable()
must be matched with a ring_buffer_record_enable() or the buffer
can be disable permanently, or enable prematurely, and cause a bug
where a reset happens while a trace is commiting.

This patch protects these two by taking the ftrace_max_lock to prevent
a switch from occurring.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

a2f80714

tracing: Use same local variable when resetting the ring buffer · 283740c6

由 Steven Rostedt 提交于 3月 12, 2010

In the ftrace code that resets the ring buffer it references the
buffer with a local variable, but then uses the tr->buffer as the
parameter to reset. If the wakeup tracer is running, which can
switch the tr->buffer with the max saved buffer, this can break
the requirement of disabling the buffer before the reset.

   buffer = tr->buffer;
   ring_buffer_record_disable(buffer);
   synchronize_sched();
   __tracing_reset(tr->buffer, cpu);

If the tr->buffer is swapped, then the reset is not happening to the
buffer that was disabled. This will cause the ring buffer to fail.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Reported-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

283740c6

function-graph: Init curr_ret_stack with ret_stack · ea14eb71

由 Steven Rostedt 提交于 3月 12, 2010

If the graph tracer is active, and a task is forked but the allocating of
the processes graph stack fails, it can cause crash later on.

This is due to the temporary stack being NULL, but the curr_ret_stack
variable is copied from the parent. If it is not -1, then in
ftrace_graph_probe_sched_switch() the following:

	for (index = next->curr_ret_stack; index >= 0; index--)
		next->ret_stack[index].calltime += timestamp;

Will cause a kernel OOPS.

Found with Li Zefan's ftrace_stress_test.

Cc: stable@kernel.org
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

ea14eb71

ring-buffer: Move disabled check into preempt disable section · 52fbe9cd

由 Lai Jiangshan 提交于 3月 08, 2010

The ring buffer resizing and resetting relies on a schedule RCU
action. The buffers are disabled, a synchronize_sched() is called
and then the resize or reset takes place.

But this only works if the disabling of the buffers are within the
preempt disabled section, otherwise a window exists that the buffers
can be written to while a reset or resize takes place.

Cc: stable@kernel.org
Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4B949E43.2010906@cn.fujitsu.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>

52fbe9cd

sysctl extern cleanup: lockdep · 2edf5e49

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move lockdep extern declarations to linux/lockdep.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2edf5e49

sysctl extern cleanup: rtmutex · 4f0e056f

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move max_lock_depth extern declaration to linux/rtmutex.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4f0e056f

sysctl extern cleanup: acct · c55b7c3e

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move acct_parm extern declaration to linux/acct.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c55b7c3e

sysctl extern cleanup: sg · 15485a46

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move sg_big_buff extern declaration to scsi/sg.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Acked-by: NDoug Gilbert <dgilbert@interlog.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

15485a46

sysctl extern cleanup: module · 5ed10910

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move modprobe_path extern declaration to linux/kmod.h
Move modules_disabled extern declaration to linux/module.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5ed10910

sysctl extern cleanup: rcu · e5ab6772

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move rcutorture_runnable extern declaration to linux/rcupdate.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Acked-by: NJosh Triplett <josh@joshtriplett.org>
Reviewed-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e5ab6772

sysctl extern cleanup: signal · d33ed52d

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move print_fatal_signals extern declaration to linux/signal.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d33ed52d

sysctl extern cleanup: C_A_D · eb5572fe

由 Dave Young 提交于 3月 10, 2010

Extern declarations in sysctl.c should be moved to their own header file,
and then include them in relavant .c files.

Move C_A_D extern variable declaration to linux/reboot.h
Signed-off-by: NDave Young <hidave.darkstar@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eb5572fe

nsproxy: remove INIT_NSPROXY() · 8467005d

由 Alexey Dobriyan 提交于 3月 10, 2010

Remove INIT_NSPROXY(), use C99 initializer.
Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it.

Note: headers trim will be done later, now it's quite pointless because
results will be invalidated by merge window.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8467005d

pid_ns: zap_pid_ns_processes: use SEND_SIG_NOINFO instead of force_sig() · 13aa9a6b

由 Oleg Nesterov 提交于 3月 10, 2010

zap_pid_ns_processes() uses force_sig(SIGKILL) to ensure SIGKILL will be
delivered to sub-namespace inits as well.  This is correct, but we are
going to change force_sig_info() semantics.  See
http://bugzilla.kernel.org/show_bug.cgi?id=15395#c31

We can use send_sig_info(SEND_SIG_NOINFO) instead, since
614c517d ("signals: SEND_SIG_NOINFO should
be considered as SI_FROMUSER()") SEND_SIG_NOINFO means "from user" and
therefore send_signal() will get the correct from_ancestor_ns = T flag.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NRoland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

13aa9a6b

copy_signal() cleanup: clean thread_group_cputime_init() · 93c59907

由 Veaceslav Falico 提交于 3月 10, 2010

Remove unneeded initializations in thread_group_cputime_init() and in
posix_cpu_timers_init_group().  They are useless after kmem_cache_zalloc()
was used in copy_signal().
Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

93c59907

copy_signal() cleanup: kill taskstats_tgid_init() and acct_init_pacct() · 4dd66e69

由 Veaceslav Falico 提交于 3月 10, 2010

Kill unused functions taskstats_tgid_init() and acct_init_pacct() because
we don't use them anywhere after using kmem_cache_zalloc() in
copy_signal().
Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4dd66e69

copy_signal() cleanup: use zalloc and remove initializations · a56704ef

由 Veaceslav Falico 提交于 3月 10, 2010

Use kmem_cache_zalloc() on signal creation and remove unneeded
initialization lines in copy_signal().
Signed-off-by: NVeaceslav Falico <vfalico@redhat.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a56704ef

cgroups: remove events before destroying subsystem state objects · a0a4db54

由 Kirill A. Shutemov 提交于 3月 10, 2010

Events should be removed after rmdir of cgroup directory, but before
destroying subsystem state objects.  Let's take reference to cgroup
directory dentry to do that.
Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a0a4db54

cgroups: fix race between userspace and kernelspace · 4ab78683

由 Kirill A. Shutemov 提交于 3月 10, 2010

Notify userspace about cgroup removing only after rmdir of cgroup
directory to avoid race between userspace and kernelspace.

eventfd are used to notify about two types of event:
 - control file-specific, like crossing memory threshold;
 - cgroup removing.

To understand what really happen, userspace can check if the cgroup still
exists.  To avoid race beetween userspace and kernelspace we have to
notify userspace about cgroup removing only after rmdir of cgroup
directory.
Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4ab78683

cgroup: implement eventfd-based generic API for notifications · 0dea1168

由 Kirill A. Shutemov 提交于 3月 10, 2010

This patchset introduces eventfd-based API for notifications in cgroups
and implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Output of time(1) on building kernel on tmpfs:

Root cgroup before changes:
	make -j2  506.37 user 60.93s system 193% cpu 4:52.77 total
Non-root cgroup before changes:
	make -j2  507.14 user 62.66s system 193% cpu 4:54.74 total
Root cgroup after changes (0 thresholds):
	make -j2  507.13 user 62.20s system 193% cpu 4:53.55 total
Non-root cgroup after changes (0 thresholds):
	make -j2  507.70 user 64.20s system 193% cpu 4:55.70 total
Root cgroup after changes (1 thresholds, never crossed):
	make -j2  506.97 user 62.20s system 193% cpu 4:53.90 total
Non-root cgroup after changes (1 thresholds, never crossed):
	make -j2  507.55 user 64.08s system 193% cpu 4:55.63 total

This patch:

Introduce the write-only file "cgroup.event_control" in every cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
  unregister_event() must be defined for the control file;
- write "<event_fd> <control_fd> <args>" to cgroup.event_control.
  Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

[kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dan Malek <dan@embeddedalley.com>
Cc: Vladislav Buzov <vbuzov@embeddedalley.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Alexander Shishkin <virtuoso@slind.org>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0dea1168

cgroups: clean up cgroup_pidlist_find() a bit · b70cc5fd

由 Li Zefan 提交于 3月 10, 2010

Don't call get_pid_ns() before we locate/alloc the ns.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Acked-by: NPaul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b70cc5fd

cgroups: blkio subsystem as module · 67523c48

由 Ben Blum 提交于 3月 10, 2010

Modify the Block I/O cgroup subsystem to be able to be built as a module.
As the CFQ disk scheduler optionally depends on blk-cgroup, config options
in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are
enhanced to support the new module dependency.
Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

67523c48

cgroups: subsystem module unloading · cf5d5941

由 Ben Blum 提交于 3月 10, 2010

Provides support for unloading modular subsystems.

This patch adds a new function cgroup_unload_subsys which is to be used
for removing a loaded subsystem during module deletion.  Reference
counting of the subsystems' modules is moved from once (at load time) to
once per attached hierarchy (in parse_cgroupfs_options and
rebind_subsystems) (i.e., 0 or 1).
Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cf5d5941

cgroups: subsystem module loading interface · e6a1105b

由 Ben Blum 提交于 3月 10, 2010

Add interface between cgroups subsystem management and module loading

This patch implements rudimentary module-loading support for cgroups -
namely, a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a
module initcall, and a struct module pointer in struct cgroup_subsys.

Several functions that might be wanted by modules have had EXPORT_SYMBOL
added to them, but it's unclear exactly which functions want it and which
won't.
Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e6a1105b

cgroups: revamp subsys array · aae8aab4

由 Ben Blum 提交于 3月 10, 2010

This patch series provides the ability for cgroup subsystems to be
compiled as modules both within and outside the kernel tree.  This is
mainly useful for classifiers and subsystems that hook into components
that are already modules.  cls_cgroup and blkio-cgroup serve as the
example use cases for this feature.

It provides an interface cgroup_load_subsys() and cgroup_unload_subsys()
which modular subsystems can use to register and depart during runtime.
The net_cls classifier subsystem serves as the example for a subsystem
which can be converted into a module using these changes.

Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear.  Iterations over the array are
modified to handle when subsystems are absent, and the dynamic section of
the array is protected by cgroup_mutex.

Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch #3 adds a mechanism for unloading modular subsystems, which includes
a more advanced rework of the rudimentary reference counting introduced in
patch 2.

Patch #4 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves.  In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy).  The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys was
the only one in the dummy cgroup (and therefore all css_sets should be
unlinked from it) - however, as all css_sets always stay attached to the
dummy cgroup anyway, these cases are ignored.  Any fix that addresses this
issue should also make sure these cases are addressed in the subsystem
loading and unloading code.

This patch:

Make subsys[] able to be dynamically populated to support modular
subsystems

This patch reworks the way the subsys[] array is used so that subsystems
can register themselves after boot time, and enables the internals of
cgroups to be able to handle when subsystems are not present or may
appear/disappear.
Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aae8aab4

cgroup: introduce coalesce css_get() and css_put() · d7b9fff7

由 Daisuke Nishimura 提交于 3月 10, 2010

Current css_get() and css_put() increment/decrement css->refcnt one by
one.

This patch add a new function __css_get(), which takes "count" as a arg
and increment the css->refcnt by "count".  And this patch also add a new
arg("count") to __css_put() and change the function to decrement the
css->refcnt by "count".

These coalesce version of __css_get()/__css_put() will be used to improve
performance of memcg's moving charge feature later, where instead of
calling css_get()/css_put() repeatedly, these new functions will be used.

No change is needed for current users of css_get()/css_put().
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NPaul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d7b9fff7

cgroup: introduce cancel_attach() · 2468c723

由 Daisuke Nishimura 提交于 3月 10, 2010

Add cancel_attach() operation to struct cgroup_subsys.  cancel_attach()
can be used when can_attach() operation prepares something for the subsys,
but we should rollback what can_attach() operation has prepared if attach
task fails after we've succeeded in can_attach().
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Reviewed-by: NPaul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2468c723

Add generic sys_olduname() · 5cacdb4a

由 Christoph Hellwig 提交于 3月 10, 2010

Add generic implementations of the old and really old uname system calls.
Note that sh only implements sys_olduname but not sys_oldolduname, but I'm
not going to bother with another ifdef for that special case.

m32r implemented an old uname but never wired it up, so kill it, too.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5cacdb4a

improve sys_newuname() for compat architectures · e28cbf22

由 Christoph Hellwig 提交于 3月 10, 2010

On an architecture that supports 32-bit compat we need to override the
reported machine in uname with the 32-bit value.  Instead of doing this
separately in every architecture introduce a COMPAT_UTS_MACHINE define in
<asm/compat.h> and apply it directly in sys_newuname().
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e28cbf22

Add generic sys_ipc wrapper · baed7fc9

由 Christoph Hellwig 提交于 3月 10, 2010

Add a generic implementation of the ipc demultiplexer syscall.  Except for
s390 and sparc64 all implementations of the sys_ipc are nearly identical.

There are slight differences in the types of the parameters, where mips
and powerpc as the only 64-bit architectures with sys_ipc use unsigned
long for the "third" argument as it gets casted to a pointer later, while
it traditionally is an "int" like most other paramters.  frv goes even
further and uses unsigned long for all parameters execept for "ptr" which
is a pointer type everywhere.  The change from int to unsigned long for
"third" and back to "int" for the others on frv should be fine due to the
in-register calling conventions for syscalls (we already had a similar
issue with the generic sys_ptrace), but I'd prefer to have the arch
maintainers looks over this in details.

Except for that h8300, m68k and m68knommu lack an impplementation of the
semtimedop sub call which this patch adds, and various architectures have
gets used - at least on i386 it seems superflous as the compat code on
x86-64 and ia64 doesn't even bother to implement it.

[akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Reviewed-by: NH. Peter Anvin <hpa@zytor.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: Andreas Schwab <schwab@linux-m68k.org>
Acked-by: NJesper Nilsson <jesper.nilsson@axis.com>
Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
Acked-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NKyle McMartin <kyle@mcmartin.ca>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

baed7fc9

11 3月, 2010 4 次提交

sched: Fix pick_next_highest_task_rt() for cgroups · 3d07467b

由 Peter Zijlstra 提交于 3月 10, 2010

Since pick_next_highest_task_rt() already iterates all the cgroups and
is really only interested in tasks, skip over the !task entries.
Reported-by: NDhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NDhaval Giani <dhaval.giani@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3d07467b

sched: Cleanup: remove unused variable in try_to_wake_up() · ab3b3aa5

由 Dan Carpenter 提交于 3月 06, 2010

We haven't used the "orig_rq" variable since
055a0086 "Fix/add missing update_rq_clock() calls"
Signed-off-by: NDan Carpenter <error27@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: efault@gmx.de
LKML-Reference: <20100306111752.GL4958@bicker>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ab3b3aa5

rcu: Increase RCU CPU stall timeouts if PROVE_RCU · 007b0924

由 Paul E. McKenney 提交于 3月 05, 2010

CONFIG_PROVE_RCU imposes additional overhead on the kernel, so
increase the RCU CPU stall timeouts in an attempt to allow for
this effect.
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-2-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

007b0924

ftrace: Replace read_barrier_depends() with rcu_dereference_raw() · 3f379b03

由 Paul E. McKenney 提交于 3月 05, 2010

Replace the calls to read_barrier_depends() in
ftrace_list_func() with rcu_dereference_raw() to improve
readability.  The reason that we use rcu_dereference_raw() here
is that removed entries are never freed, instead they are simply
leaked.  This is one of a very few cases where use of
rcu_dereference_raw() is the long-term right answer.  And I
don't yet know of any others.  ;-)
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1267830207-9474-1-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3f379b03

10 3月, 2010 1 次提交

perf: Provide generic perf_sample_data initialization · dc1d628a

由 Peter Zijlstra 提交于 3月 03, 2010

This makes it easier to extend perf_sample_data and fixes a bug on arm
and sparc, which failed to set ->raw to NULL, which can cause crashes
when combined with PERF_SAMPLE_RAW.

It also optimizes PowerPC and tracepoint, because the struct
initialization is forced to zero out the whole structure.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NJean Pihet <jpihet@mvista.com>
Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: Jamie Iles <jamie.iles@picochip.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: stable@kernel.org
LKML-Reference: <20100304140100.315416040@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

dc1d628a

08 3月, 2010 1 次提交

sysfs: Use sysfs_attr_init and sysfs_bin_attr_init on module dynamic attributes · 361795b1

由 Eric W. Biederman 提交于 2月 12, 2010

A little more whack-a-mole annotating the dynamic sysfs attributes.  I
had everything built into my earlier test kernel, and so I missed
these.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

361795b1

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多