提交 · 80628ca06c5d42929de6bc22c0a41589a834d151 · OpenHarmony / kernel_linux

04 7月, 2013 10 次提交

kernel/fork.c:copy_process(): unify CLONE_THREAD-or-thread_group_leader code · 80628ca0

由 Oleg Nesterov 提交于 7月 03, 2013

Cleanup and preparation for the next changes.

Move the "if (clone_flags & CLONE_THREAD)" code down under "if
(likely(p->pid))" and turn it into into the "else" branch.  This makes the
process/thread initialization more symmetrical and removes one check.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Sergey Dyasly <dserrg@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

80628ca0

fork: reorder permissions when violating number of processes limits · b57922b6

由 Eric Paris 提交于 7月 03, 2013

When a task is attempting to violate the RLIMIT_NPROC limit we have a
check to see if the task is sufficiently priviledged. The check first
looks at CAP_SYS_ADMIN, then CAP_SYS_RESOURCE, then if the task is uid=0.

A result is that tasks which are allowed by the uid=0 check are first
checked against the security subsystem. This results in the security
subsystem auditting a denial for sys_admin and sys_resource and then the
task passing the uid=0 check.

This patch rearranges the code to first check uid=0, since if we pass that
we shouldn't hit the security system at all. We then check sys_resource,
since it is the smallest capability which will solve the problem. Lastly
we check the fallback everything cap_sysadmin. We don't want to give this
capability many places since it is so powerful.

This will eliminate many of the false positive/needless denial messages we
get when a root task tries to violate the nproc limit. (note that
kthreads count against root, so on a sufficiently large machine we can
actually get past the default limits before any userspace tasks are
launched.)
Signed-off-by: NEric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b57922b6

exit.c: unexport __set_special_pids() · 81dabb46

由 Oleg Nesterov 提交于 7月 03, 2013

Move __set_special_pids() from exit.c to sys.c close to its single caller
and make it static.

And rename it to set_special_pids(), another helper with this name has
gone away.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

81dabb46

usermodehelper: kill the sub_info->path[0] check · 7f57cfa4

由 Oleg Nesterov 提交于 7月 03, 2013

call_usermodehelper_exec() does nothing but returns success if path[0] ==
0.  The only user which needs this strange feature is request_module(), it
can check modprobe_path[0] itself like other users do if they want to
detect the "disabled by admin" case.

Kill it.  Not only it looks strange, it can confuse other callers.  And
this allows us to revert 264b83c0 ("usermodehelper: check
subprocess_info->path != NULL"), do_execve(NULL) is safe.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Lucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7f57cfa4

ptrace: add ability to get/set signal-blocked mask · 29000cae

由 Andrey Vagin 提交于 7月 03, 2013

crtools uses a parasite code for dumping processes.  The parasite code is
injected into a process with help PTRACE_SEIZE.

Currently crtools blocks signals from a parasite code.  If a process has
pending signals, crtools wait while a process handles these signals.

This method is not suitable for stopped tasks.  A stopped task can have a
few pending signals, when we will try to execute a parasite code, we will
need to drop SIGSTOP, but all other signals must remain pending, because a
state of processes must not be changed during checkpointing.

This patch adds two ptrace commands to set/get signal-blocked mask.

I think gdb can use this commands too.

[akpm@linux-foundation.org: be consistent with brace layout]
Signed-off-by: NAndrey Vagin <avagin@openvz.org>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29000cae

kprobes: handle empty/invalid input to debugfs "enabled" file · 10fb46d5

由 Mathias Krause 提交于 7月 03, 2013

When writing invalid input to 'debug/kprobes/enabled' it'll silently be
ignored.  Even worse, when writing an empty string to this file, the
outcome is purely random as the switch statement will make its decision
based on the value of an uninitialized stack variable.

Fix this by handling invalid/empty input as error returning -EINVAL.
Signed-off-by: NMathias Krause <minipli@googlemail.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

10fb46d5

kernel/sys.c:do_sysinfo(): use get_monotonic_boottime() · 45c64940

由 Oleg Nesterov 提交于 7月 03, 2013

Change do_sysinfo() to use get_monotonic_boottime() instead of
do_posix_clock_monotonic_gettime() + monotonic_to_bootbased().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NJohn Stultz <johnstul@us.ibm.com>
Cc: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

45c64940

kernel/sys.c: sys_reboot(): fix malformed panic message · 7ec75e1c

由 liguang 提交于 7月 03, 2013

If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will
stay on the same line with the next message, so append a '\n'.
Signed-off-by: Nliguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7ec75e1c

drivers: avoid parsing names as kthread_run() format strings · f170168b

由 Kees Cook 提交于 7月 03, 2013

Calling kthread_run with a single name parameter causes it to be handled
as a format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents.
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f170168b

mm: use totalram_pages instead of num_physpages at runtime · 0ed5fd13

由 Jiang Liu 提交于 7月 03, 2013

The global variable num_physpages is scheduled to be removed, so use
totalram_pages instead of num_physpages at runtime.
Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ed5fd13

30 6月, 2013 2 次提交

cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy · c7ba8287

由 Tejun Heo 提交于 6月 29, 2013

0ce6cba3 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when
comparing mount options") only updated the remount path but
CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options
while mounting an existing hierarchy.  As option mismatch triggers a
warning but doesn't fail the mount without sane_behavior, this only
triggers a spurious warning message.

Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new
and existing root options.
Signed-off-by: NTejun Heo <tj@kernel.org>

c7ba8287

Fix: kernel/ptrace.c: ptrace_peek_siginfo() missing __put_user() validation · 706b23bd

由 Mathieu Desnoyers 提交于 6月 28, 2013

This __put_user() could be used by unprivileged processes to write into
kernel memory.  The issue here is that even if copy_siginfo_to_user()
fails, the error code is not checked before __put_user() is executed.

Luckily, ptrace_peek_siginfo() has been added within the 3.10-rc cycle,
so it has not hit a stable release yet.
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

706b23bd

28 6月, 2013 8 次提交

softirq: Use _RET_IP_ · d2e08473

由 Davidlohr Bueso 提交于 4月 30, 2013

Use the already defined macro to pass the function return address.
Signed-off-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/r/1367347569.1784.3.camel@buesod1.americas.hpqcorp.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

d2e08473

sched/debug: Remove CONFIG_FAIR_GROUP_SCHED mask · 333bb864

由 Alex Shi 提交于 6月 28, 2013

Now that we are using runnable load avg in sched balance, we don't
need to keep it under CONFIG_FAIR_GROUP_SCHED.

Also align the code style to #ifdef instead of #if defined() and
reorder the tg output info.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Cc: pjt@google.com
Cc: kamalesh@linux.vnet.ibm.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1372417835-4698-1-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

333bb864

T
genirq: Add the generic chip to the genirq docbook · ccc414f8
由 Thomas Gleixner 提交于 6月 28, 2013
```
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
```
ccc414f8

genirq: generic-chip: Export some irq_gc_ functions · d55f0cc4

由 Fabio Estevam 提交于 6月 28, 2013

When building imx_v6_v7_defconfig with imx-drm drivers selected as
modules, we get the following build errors:

ERROR: "irq_gc_mask_clr_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_mask_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!
ERROR: "irq_gc_ack_set_bit" [drivers/staging/imx-drm/ipu-v3/imx-ipu-v3.ko] undefined!

Export the required functions to avoid this problem.
Signed-off-by: NFabio Estevam <fabio.estevam@freescale.com>
Cc: shawn.guo@linaro.org
Cc: kernel@pengutronix.de
Link: http://lkml.kernel.org/r/1372389789-7048-1-git-send-email-festevam@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

d55f0cc4

genirq: Fix can_request_irq() for IRQs without an action · 2779db8d

由 Ben Hutchings 提交于 6月 28, 2013

Commit 02725e74 ('genirq: Use irq_get/put functions'),
inadvertently changed can_request_irq() to return 0 for IRQs that have
no action.  This causes pcibios_lookup_irq() to select only IRQs that
already have an action with IRQF_SHARED set, or to fail if there are
none.  Change can_request_irq() to return 1 for IRQs that have no
action (if the first two conditions are met).
Reported-by: NBjarni Ingi Gislason <bjarniig@rhi.hi.is>
Tested-by: Bjarni Ingi Gislason <bjarniig@rhi.hi.is> (against 3.2)
Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
Cc: 709647@bugs.debian.org
Cc: stable@vger.kernel.org # 2.6.39+
Link: http://bugs.debian.org/709647
Link: http://lkml.kernel.org/r/1372383630.23847.40.camel@deadeye.wl.decadent.org.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

2779db8d

sched/debug: Fix formatting of /proc/<PID>/sched · add332a1

由 Kamalesh Babulal 提交于 6月 27, 2013

This patch alters format string's width, to align all statistics
at par with the longest struct sched_statistic member name under
/proc/<PID>/sched.
Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20130627165005.GA15583@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

add332a1

cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options · 0ce6cba3

由 Tejun Heo 提交于 6月 27, 2013

1672d040 ("cgroup: fix cgroupfs_root early destruction path")
introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
subsys binding on a new root; however, this broke remounts.
cgroup_remount() doesn't allow changing root options via remount and
CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
makes the function reject all remounts.

Fix it by putting the options part in the lower 16 bits of root->flags
and masking the comparions.  While at it, make cgroup_remount() emit
an error message explaining why it's rejecting a remount request, so
that it's less of a mystery.
Signed-off-by: NTejun Heo <tj@kernel.org>

0ce6cba3

cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts() · e2bd416f

由 Tejun Heo 提交于 6月 27, 2013

eb178d06 ("cgroup: grab cgroup_mutex in
drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
grab cgroup_mutex to make lockdep assertion in for_each_subsys()
happy.  Unfortunately, cgroup_remount() calls the function while
holding cgroup_mutex in its failure path leading to the following
deadlock.

# mount -t cgroup -o remount,memory,blkio cgroup blkio

 cgroup: option changes via remount are deprecated (pid=525 comm=mount)

 =============================================
 [ INFO: possible recursive locking detected ]
 3.10.0-rc4-work+ #1 Not tainted
 ---------------------------------------------
 mount/525 is trying to acquire lock:
  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0

 but task is already holding lock:
  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200

 other info that might help us debug this:
  Possible unsafe locking scenario:

	CPU0
	----
   lock(cgroup_mutex);
   lock(cgroup_mutex);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 4 locks held by mount/525:
  #0:  (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
  #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
  #2:  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
  #3:  (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200

 stack backtrace:
 CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
  ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
  ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
 Call Trace:
  [<ffffffff81c24bb1>] dump_stack+0x19/0x1b
  [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
  [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
  [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
  [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
  [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
  [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
  [<ffffffff811e9d41>] do_mount+0x5f1/0xa30
  [<ffffffff811ea203>] SyS_mount+0x83/0xc0
  [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b

Fix it by moving the drop_parsed_module_refcounts() invocation outside
cgroup_mutex.
Signed-off-by: NTejun Heo <tj@kernel.org>

e2bd416f

27 6月, 2013 18 次提交

sched/fair: Fix typo describing flags in enqueue_entity · 0fc576d5

由 Kamalesh Babulal 提交于 6月 27, 2013

Fix spelling of 'calling' in description of se flags in
enqueue_entity().
Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20130627055418.GA18582@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0fc576d5

sched/debug: Add load-tracking statistics to task · 939fd731

由 Kamalesh Babulal 提交于 6月 25, 2013

At present we print per-entity load-tracking statistics for
cfs_rq of cgroups/runqueues. Given that per task statistics
is maintained, it can be used to know the contribution made
by the task to its parenting cfs_rq level.

This patch adds per-task load-tracking statistics to /proc/<PID>/sched.
Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20130625080336.GA20175@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

939fd731

sched: Change get_rq_runnable_load() to static and inline · a9dc5d0e

由 Alex Shi 提交于 6月 20, 2013

Based-on-patch-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-14-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a9dc5d0e

sched/tg: Remove tg.load_weight · a9cef46a

由 Alex Shi 提交于 6月 20, 2013

Since no one use it.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-13-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a9cef46a

sched/cfs_rq: Change atomic64_t removed_load to atomic_long_t · 2509940f

由 Alex Shi 提交于 6月 20, 2013

Similar to runnable_load_avg, blocked_load_avg variable, long type is
enough for removed_load in 64 bit or 32 bit machine.

Then we avoid the expensive atomic64 operations on 32 bit machine.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-12-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2509940f

sched/tg: Use 'unsigned long' for load variable in task group · bf5b986e

由 Alex Shi 提交于 6月 20, 2013

Since tg->load_avg is smaller than tg->load_weight, we don't need a
atomic64_t variable for load_avg in 32 bit machine.
The same reason for cfs_rq->tg_load_contrib.

The atomic_long_t/unsigned long variable type are more efficient and
convenience for them.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-11-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

bf5b986e

sched: Change cfs_rq load avg to unsigned long · 72a4cf20

由 Alex Shi 提交于 6月 20, 2013

Since the 'u64 runnable_load_avg, blocked_load_avg' in cfs_rq struct are
smaller than 'unsigned long' cfs_rq->load.weight. We don't need u64
vaiables to describe them. unsigned long is more efficient and convenience.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-10-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

72a4cf20

sched: Consider runnable load average in move_tasks() · a003a25b

由 Alex Shi 提交于 6月 20, 2013

Aside from using runnable load average in background, move_tasks is
also the key function in load balance. We need consider the runnable
load average in it in order to make it an apple to apple load
comparison.

Morten had caught a div u64 bug on ARM, thanks!

Thanks-to: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-8-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a003a25b

sched: Compute runnable load avg in cpu_load and cpu_avg_load_per_task · b92486cb

由 Alex Shi 提交于 6月 20, 2013

They are the base values in load balance, update them with rq runnable
load average, then the load balance will consider runnable load avg
naturally.

We also try to include the blocked_load_avg as cpu load in balancing,
but that cause kbuild performance drop 6% on every Intel machine, and
aim7/oltp drop on some of 4 CPU sockets machines.
Or only add blocked_load_avg into get_rq_runable_load, hackbench still
drop a little on NHM EX.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-7-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b92486cb

sched: Update cpu load after task_tick · 83dfd523

由 Alex Shi 提交于 6月 20, 2013

To get the latest runnable info, we need do this cpuload update after
task_tick.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-6-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

83dfd523

sched: Fix sleep time double accounting in enqueue entity · 282cf499

由 Alex Shi 提交于 6月 20, 2013

The woken migrated task will __synchronize_entity_decay(se); in
migrate_task_rq_fair, then it needs to set
`se->avg.last_runnable_update -= (-se->avg.decay_count) << 20' before
update_entity_load_avg, in order to avoid sleep time is updated twice
for se.avg.load_avg_contrib in both __syncchronize and
update_entity_load_avg.

However if the sleeping task is woken up from the same cpu, it miss
the last_runnable_update before update_entity_load_avg(se, 0, 1), then
the sleep time was used twice in both functions.  So we need to remove
the double sleep time accounting.

Paul also contributed some code comments in this commit.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-5-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

282cf499

sched: Set an initial value of runnable avg for new forked task · a75cdaa9

由 Alex Shi 提交于 6月 20, 2013

We need to initialize the se.avg.{decay_count, load_avg_contrib} for a
new forked task. Otherwise random values of above variables cause a
mess when a new task is enqueued:

    enqueue_task_fair
        enqueue_entity
            enqueue_entity_load_avg

and make fork balancing imbalance due to incorrect load_avg_contrib.

Further more, Morten Rasmussen notice some tasks were not launched at
once after created. So Paul and Peter suggest giving a start value for
new task runnable avg time same as sched_slice().

PeterZ said:

> So the 'problem' is that our running avg is a 'floating' average; ie. it
> decays with time. Now we have to guess about the future of our newly
> spawned task -- something that is nigh impossible seeing these CPU
> vendors keep refusing to implement the crystal ball instruction.
>
> So there's two asymptotic cases we want to deal well with; 1) the case
> where the newly spawned program will be 'nearly' idle for its lifetime;
> and 2) the case where its cpu-bound.
>
> Since we have to guess, we'll go for worst case and assume its
> cpu-bound; now we don't want to make the avg so heavy adjusting to the
> near-idle case takes forever. We want to be able to quickly adjust and
> lower our running avg.
>
> Now we also don't want to make our avg too light, such that it gets
> decremented just for the new task not having had a chance to run yet --
> even if when it would run, it would be more cpu-bound than not.
>
> So what we do is we make the initial avg of the same duration as that we
> guess it takes to run each task on the system at least once -- aka
> sched_slice().
>
> Of course we can defeat this with wakeup/fork bombs, but in the 'normal'
> case it should be good enough.

Paul also contributed most of the code comments in this commit.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Reviewed-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
Reviewed-by: NPaul Turner <pjt@google.com>
[peterz; added explanation of sched_slice() usage]
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-4-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a75cdaa9

sched: Move a few runnable tg variables into CONFIG_SMP · fa6bddeb

由 Alex Shi 提交于 6月 20, 2013

The following 2 variables are only used under CONFIG_SMP, so its
better to move their definiation into CONFIG_SMP too.

        atomic64_t load_avg;
        atomic_t runnable_avg;
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1371694737-29336-3-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

fa6bddeb

Revert "sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking" · 141965c7

由 Alex Shi 提交于 6月 26, 2013

Remove CONFIG_FAIR_GROUP_SCHED that covers the runnable info, then
we can use runnable load variables.

Also remove 2 CONFIG_FAIR_GROUP_SCHED setting which is not in reverted
patch(introduced in 9ee474f5), but also need to revert.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/51CA76A3.3050207@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

141965c7

cgroup: always use RCU accessors for protected accesses · a4ea1cc9

由 Tejun Heo 提交于 6月 21, 2013

kernel/cgroup.c still has places where a RCU pointer is set and
accessed directly without going through RCU_INIT_POINTER() or
rcu_dereference_protected().  They're all properly protected accesses
so nothing is broken but it leads to spurious sparse RCU address space
warnings.

Substitute direct accesses with RCU_INIT_POINTER() and
rcu_dereference_protected().  Note that %true is specified as the
extra condition for all derference updates.  This isn't ideal as all
it does is suppressing warning without actually policing
synchronization rules; however, most are scheduled to be removed
pretty soon along with css_id itself, so no reason to be more
elaborate.

Combined with the previous changes, this removes all RCU related
sparse warnings from cgroup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Acked-by; Li Zefan <lizefan@huawei.com>

a4ea1cc9

cgroup: fix RCU accesses around task->cgroups · a8ad805c

由 Tejun Heo 提交于 6月 21, 2013

There are several places in kernel/cgroup.c where task->cgroups is
accessed and modified without going through proper RCU accessors.
None is broken as they're all lock protected accesses; however, this
still triggers sparse RCU address space warnings.

* Consistently use task_css_set() for task->cgroups dereferencing.

* Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
  exit.

* Remove unnecessary rcu_dereference_raw() from cset->subsys[]
  dereference in cgroup_exit().
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

a8ad805c

cgroup: grab cgroup_mutex in drop_parsed_module_refcounts() · eb178d06

由 Tejun Heo 提交于 6月 25, 2013

This isn't strictly necessary as all subsystems specified in
@subsys_mask are guaranteed to be pinned; however, it does spuriously
trigger lockdep warning.  Let's grab cgroup_mutex around it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

eb178d06

cgroup: fix cgroupfs_root early destruction path · 1672d040

由 Tejun Heo 提交于 6月 25, 2013

cgroupfs_root used to have ->actual_subsys_mask in addition to
->subsys_mask.  a8a648c4 ("cgroup: remove
cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
essentially temporary and doesn't belong in cgroupfs_root; however,
the patch made it impossible to tell whether a cgroupfs_root actually
has the subsystems bound or just have the bits set leading to the
following BUG when trying to mount with subsystems which are already
mounted elsewhere.

 kernel BUG at kernel/cgroup.c:1038!
 invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
 ...
 CPU: 1 PID: 7973 Comm: mount Tainted: G        W    3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
 task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
 RIP: 0010:[<ffffffff81249b29>]  [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
 ...
 Call Trace:
  [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
  [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
  [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
  [<ffffffff81257169>] cpuset_mount+0xd9/0x110
  [<ffffffff813d2580>] mount_fs+0xb0/0x2d0
  [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
  [<ffffffff814070b5>] do_new_mount+0x145/0x2c0
  [<ffffffff814085d6>] do_mount+0x356/0x3c0
  [<ffffffff8140873d>] SyS_mount+0xfd/0x140
  [<ffffffff854eb600>] tracesys+0xdd/0xe2

We still want rebind_subsystems() to take added/removed masks, so
let's fix it by marking whether a cgroupfs_root has finished binding
or not.  Also, document what's going on around ->subsys_mask
initialization so that similar mistakes aren't repeated.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

1672d040

26 6月, 2013 2 次提交

mutex: Add w/w mutex slowpath debugging · 23010027

由 Daniel Vetter 提交于 6月 20, 2013

Injects EDEADLK conditions at pseudo-random interval, with
exponential backoff up to UINT_MAX (to ensure that every lock
operation still completes in a reasonable time).

This way we can test the wound slowpath even for ww mutex users
where contention is never expected, and the ww deadlock
avoidance algorithm is only needed for correctness against
malicious userspace. An example would be protecting kernel
modesetting properties, which thanks to single-threaded X isn't
really expected to contend, ever.

I've looked into using the CONFIG_FAULT_INJECTION
infrastructure, but decided against it for two reasons:

- EDEADLK handling is mandatory for ww mutex users and should
  never affect the outcome of a syscall. This is in contrast to -ENOMEM
  injection. So fine configurability isn't required.

- The fault injection framework only allows to set a simple
  probability for failure. Now the probability that a ww mutex acquire
  stage with N locks will never complete (due to too many injected
  EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock
  operations for the completely uncontended case would be O(exp(N)).
  The per-acuiqire ctx exponential backoff solution choosen here only
  results in O(log N) overhead due to injection and so O(log N * N)
  lock operations. This way we can fail with high probability (and so
  have good test coverage even for fancy backoff and lock acquisition
  paths) without running into patalogical cases.

Note that EDEADLK will only ever be injected when we managed to
acquire the lock. This prevents any behaviour changes for users
which rely on the EALREADY semantics.
Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: NMaarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patserSigned-off-by: NIngo Molnar <mingo@kernel.org>

23010027

mutex: Add support for wound/wait style locks · 040a0a37

由 Maarten Lankhorst 提交于 6月 24, 2013

Wound/wait mutexes are used when other multiple lock
acquisitions of a similar type can be done in an arbitrary
order. The deadlock handling used here is called wait/wound in
the RDBMS literature: The older tasks waits until it can acquire
the contended lock. The younger tasks needs to back off and drop
all the locks it is currently holding, i.e. the younger task is
wounded.

For full documentation please read Documentation/ww-mutex-design.txt.

References: https://lwn.net/Articles/548909/Signed-off-by: NMaarten Lankhorst <maarten.lankhorst@canonical.com>
Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: NRob Clark <robdclark@gmail.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: dri-devel@lists.freedesktop.org
Cc: linaro-mm-sig@lists.linaro.org
Cc: rostedt@goodmis.org
Cc: daniel@ffwll.ch
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

040a0a37

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多