提交 · 373d4d099761cb1f637bed488ab3871945882273 · openeuler / raspberrypi-kernel

21 1月, 2013 1 次提交

taint: add explicit flag to show whether lock dep is still OK. · 373d4d09

由 Rusty Russell 提交于 1月 21, 2013

Fix up all callers as they were before, with make one change: an
unsigned module taints the kernel, but doesn't turn off lockdep.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

373d4d09

11 12月, 2012 5 次提交

mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a

由 Mel Gorman 提交于 11月 23, 2012

The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
depends on scheduling debug being enabled but it's perfectly legimate to
disable automatic NUMA balancing even without this option. This should
take care of it.
Signed-off-by: NMel Gorman <mgorman@suse.de>

3105b86a

mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e

由 Mel Gorman 提交于 11月 22, 2012

This patch adds Kconfig options and kernel parameters to allow the
enabling and disabling of automatic NUMA balancing. The existance
of such a switch was and is very important when debugging problems
related to transparent hugepages and we should have the same for
automatic NUMA placement.
Signed-off-by: NMel Gorman <mgorman@suse.de>

1a687c2e

mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd

由 Mel Gorman 提交于 11月 21, 2012

The PTE scanning rate and fault rates are two of the biggest sources of
system CPU overhead with automatic NUMA placement. Ideally a proper policy
would detect if a workload was properly placed, schedule and adjust the
PTE scanning rate accordingly. We do not track the necessary information
to do that but we at least know if we migrated or not.

This patch scans slower if a page was not migrated as the result of a
NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
now higher than the previous default. Once every minute it will reset
the scanner in case of phase changes.

This is hilariously crude and the numbers are arbitrary. Workloads will
converge quite slowly in comparison to what a proper policy should be able
to do. On the plus side, we will chew up less CPU for workloads that have
no need for automatic balancing.
Signed-off-by: NMel Gorman <mgorman@suse.de>

b8593bfd

mm: sched: numa: Implement slow start for working set sampling · 4b96a29b

由 Peter Zijlstra 提交于 10月 25, 2012

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
knob.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>

4b96a29b

mm: numa: Add fault driven placement and migration · cbee9f88

由 Peter Zijlstra 提交于 10月 25, 2012

NOTE: This patch is based on "sched, numa, mm: Add fault driven
placement and migration policy" but as it throws away all the policy
to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on. In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>

cbee9f88

01 12月, 2012 1 次提交

context_tracking: New context tracking susbsystem · 91d1aa43

由 Frederic Weisbecker 提交于 11月 27, 2012

Create a new subsystem that probes on kernel boundaries
to keep track of the transitions between level contexts
with two basic initial contexts: user or kernel.

This is an abstraction of some RCU code that use such tracking
to implement its userspace extended quiescent state.

We need to pull this up from RCU into this new level of indirection
because this tracking is also going to be used to implement an "on
demand" generic virtual cputime accounting. A necessary step to
shutdown the tick while still accounting the cputime.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Li Zhong <zhong@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
[ paulmck: fix whitespace error and email address. ]
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

91d1aa43

28 11月, 2012 1 次提交

sched: add notifier for cross-cpu migrations · 582b336e

由 Marcelo Tosatti 提交于 11月 27, 2012

Originally from Jeremy Fitzhardinge.
Acked-by: NIngo Molnar <mingo@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>

582b336e

20 11月, 2012 2 次提交

userns: Kill task_user_ns · 4c44aaaf

由 Eric W. Biederman 提交于 7月 26, 2012

The task_user_ns function hides the fact that it is getting the user
namespace from struct cred on the task.  struct cred may go away as
soon as the rcu lock is released.  This leads to a race where we
can dereference a stale user namespace pointer.

To make it obvious a struct cred is involved kill task_user_ns.

To kill the race modify the users of task_user_ns to only
reference the user namespace while the rcu lock is held.

Cc: Kees Cook <keescook@chromium.org>
Cc: James Morris <james.l.morris@oracle.com>
Acked-by: NKees Cook <keescook@chromium.org>
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

4c44aaaf

cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() · 92fb9748

由 Tejun Heo 提交于 11月 19, 2012

Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are.  Also, update documentation.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

92fb9748

17 11月, 2012 1 次提交

sched: Mark RCU reader in sched_show_task() · 4e79752c

由 Paul E. McKenney 提交于 11月 07, 2012

When sched_show_task() is invoked from try_to_freeze_tasks(), there is
no RCU read-side critical section, resulting in the following splat:

[  125.780730] ===============================
[  125.780766] [ INFO: suspicious RCU usage. ]
[  125.780804] 3.7.0-rc3+ #988 Not tainted
[  125.780838] -------------------------------
[  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
[  125.780946]
[  125.780946] other info that might help us debug this:
[  125.780946]
[  125.781031]
[  125.781031] rcu_scheduler_active = 1, debug_locks = 0
[  125.781087] 4 locks held by s2ram/4211:
[  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
[  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
[  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
[  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
[  125.781543]
[  125.781543] stack backtrace:
[  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
[  125.781632] Call Trace:
[  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
[  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
[  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
[  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
[  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
[  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
[  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
[  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
[  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
[  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
[  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b

This commit therefore adds the needed RCU read-side critical section.
Reported-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

4e79752c

24 10月, 2012 6 次提交

sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking · f4e26b12

由 Paul Turner 提交于 10月 04, 2012

While per-entity load-tracking is generally useful, beyond computing shares
distribution, e.g. runnable based load-balance (in progress), governors,
power-management, etc.

These facilities are not yet consumers of this data. This may be trivially
reverted when the information is required; but avoid paying the overhead for
calculations we will not use until then.
Signed-off-by: NPaul Turner <pjt@google.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141507.422162369@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

f4e26b12

sched: Add an rq migration call-back to sched_class · 0a74bef8

由 Paul Turner 提交于 10月 04, 2012

Since we are now doing bottom up load accumulation we need explicit
notification when a task has been re-parented so that the old hierarchy can be
updated.

Adds: migrate_task_rq(struct task_struct *p, int next_cpu)

(The alternative is to do this out of __set_task_cpu, but it was suggested that
this would be a cleaner encapsulation.)
Signed-off-by: NPaul Turner <pjt@google.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.660023400@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0a74bef8

sched: Maintain the load contribution of blocked entities · 9ee474f5

由 Paul Turner 提交于 10月 04, 2012

We are currently maintaining:

  runnable_load(cfs_rq) = \Sum task_load(t)

For all running children t of cfs_rq.  While this can be naturally updated for
tasks in a runnable state (as they are scheduled); this does not account for
the load contributed by blocked task entities.

This can be solved by introducing a separate accounting for blocked load:

  blocked_load(cfs_rq) = \Sum runnable(b) * weight(b)

Obviously we do not want to iterate over all blocked entities to account for
their decay, we instead observe that:

  runnable_load(t) = \Sum p_i*y^i

and that to account for an additional idle period we only need to compute:

  y*runnable_load(t).

This means that we can compute all blocked entities at once by evaluating:

  blocked_load(cfs_rq)` = y * blocked_load(cfs_rq)

Finally we maintain a decay counter so that when a sleeping entity re-awakens
we can determine how much of its load should be removed from the blocked sum.
Signed-off-by: NPaul Turner <pjt@google.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.585389902@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

9ee474f5

sched: Track the runnable average on a per-task entity basis · 9d85f21c

由 Paul Turner 提交于 10月 04, 2012

Instead of tracking averaging the load parented by a cfs_rq, we can track
entity load directly. With the load for a given cfs_rq then being the sum
of its children.

To do this we represent the historical contribution to runnable average
within each trailing 1024us of execution as the coefficients of a
geometric series.

We can express this for a given task t as:

  runnable_sum(t) = \Sum u_i * y^i, runnable_avg_period(t) = \Sum 1024 * y^i
  load(t) = weight_t * runnable_sum(t) / runnable_avg_period(t)

Where: u_i is the usage in the last i`th 1024us period (approximately 1ms)
~ms and y is chosen such that y^k = 1/2.  We currently choose k to be 32 which
roughly translates to about a sched period.
Signed-off-by: NPaul Turner <pjt@google.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120823141506.372695337@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

9d85f21c

rcu: Print remote CPU's stacks in stall warnings · b637a328

由 Paul E. McKenney 提交于 9月 19, 2012

The RCU CPU stall warnings rely on trigger_all_cpu_backtrace() to
do NMI-based dump of the stack traces of all CPUs.  Unfortunately, a
number of architectures do not implement trigger_all_cpu_backtrace(), in
which case RCU falls back to just dumping the stack of the running CPU.
This is unhelpful in the case where the running CPU has detected that
some other CPU has stalled.

This commit therefore makes the running CPU dump the stacks of the
tasks running on the stalled CPUs.
Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

b637a328

rcu: Remove rcu_switch() · 4d9a5d43

由 Frederic Weisbecker 提交于 10月 11, 2012

It's only there to call rcu_user_hooks_switch(). Let's
just call rcu_user_hooks_switch() directly, we don't need this
function in the middle.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

4d9a5d43

05 10月, 2012 2 次提交

sched: Update sched_domains_numa_masks[][] when new cpus are onlined · 301a5cba

由 Tang Chen 提交于 9月 25, 2012

Once array sched_domains_numa_masks[] []is defined, it is never updated.

When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic:

[ 3189.403280] Call Trace:
[ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
[ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
[ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
[ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
[ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
[ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
[ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
[ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
[ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
[ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
[ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
[ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
[ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
[ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
[ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
[ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
[ fixed compile warning ]
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

301a5cba

sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions · 5f7865f3

由 Tang Chen 提交于 9月 25, 2012

We should temporarily reset 'sched_domains_numa_levels' to 0 after
it is reset to 'level' in sched_init_numa(). If it fails to allocate
memory for array sched_domains_numa_masks[][], the array will contain
less then 'level' members. This could be dangerous when we use it to
iterate array sched_domains_numa_masks[][] in other functions.

This patch set sched_domains_numa_levels to 0 before initializing
array sched_domains_numa_masks[][], and reset it to 'level' when
sched_domains_numa_masks[][] is fully initialized.
Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5f7865f3

01 10月, 2012 1 次提交

sanitize tsk_is_polling() · 16a80163

由 Al Viro 提交于 6月 01, 2012

Make default just return 0.  The current default (checking
TIF_POLLING_NRFLAG) is taken to architectures that need it;
ones that don't do polling in their idle threads don't need
to defined TIF_POLLING_NRFLAG at all.

ia64 defined both TS_POLLING (used by its tsk_is_polling())
and TIF_POLLING_NRFLAG (not used at all).  Killed the latter...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

16a80163

26 9月, 2012 3 次提交

rcu: Exit RCU extended QS on user preemption · 20ab65e3

由 Frederic Weisbecker 提交于 7月 11, 2012

When exceptions or irq are about to resume userspace, if
the task needs to be rescheduled, the arch low level code
calls schedule() directly.

If we call it, it is because we have the TIF_RESCHED flag:

- It can be set after random local calls to set_need_resched()
(RCU, drm, ...)

- A wake up happened and the CPU needs preemption. This can
  happen in several ways:

    * Remotely: the remote waking CPU has set TIF_RESCHED and send the
      wakee an IPI to schedule the new task.
    * Remotely enqueued: the remote waking CPU sends an IPI to the target
      and the wake up is made by the target.
    * Locally: waking CPU == wakee CPU and the wakeup is done locally.
      set_need_resched() is called without IPI.

In the case of local and remotely enqueued wake ups, the tick can
be restarted when we enqueue the new task and RCU can exit the
extended quiescent state at the same time. Then by the time we reach
irq exit path and we call schedule, we are not in RCU user mode.

But if we call schedule() only because something called set_need_resched(),
RCU may still be in user mode when we reach schedule.

Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
flag and call schedule while the IPI has not yet happen to restart the
tick and exit RCU user mode.

We need to manually protect against these corner cases.

Create a new API schedule_user() that calls schedule() inside
rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
will need to rely on it now to implement user preemption safely.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: NJosh Triplett <josh@joshtriplett.org>

20ab65e3

rcu: Exit RCU extended QS on kernel preemption after irq/exception · 90a340ed

由 Frederic Weisbecker 提交于 7月 11, 2012

When an exception or an irq exits, and we are going to resume into
interrupted kernel code, the low level architecture code calls
preempt_schedule_irq() if there is a need to reschedule.

If the interrupt/exception occured between a call to rcu_user_enter()
(from syscall exit, exception exit, do_notify_resume exit, ...) and
a real resume to userspace (iret,...), preempt_schedule_irq() can be
called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
is going to run kernel code and may be some RCU read side critical
section. We must exit the userspace extended quiescent state before
we call it.

To solve this, just call rcu_user_exit() in the beginning of
preempt_schedule_irq().
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: NJosh Triplett <josh@joshtriplett.org>

90a340ed

rcu: Switch task's syscall hooks on context switch · 04e7e951

由 Frederic Weisbecker 提交于 7月 16, 2012

Clear the syscalls hook of a task when it's scheduled out so that if
the task migrates, it doesn't run the syscall slow path on a CPU
that might not need it.

Also set the syscalls hook on the next task if needed.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: NJosh Triplett <josh@joshtriplett.org>

04e7e951

25 9月, 2012 1 次提交

cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f

由 Frederic Weisbecker 提交于 9月 08, 2012

Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>

bf9fae9f

23 9月, 2012 1 次提交

sched: Fix load avg vs cpu-hotplug · 5d180232

由 Peter Zijlstra 提交于 8月 20, 2012

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

[ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
miscounting noted by Rakib. ]
Reported-by: NRakib Mullick <rakib.mullick@gmail.com>
Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>

5d180232

17 9月, 2012 1 次提交

Revert "sched: Improve scalability via 'CPU buddies', which withstand random perturbations" · 37407ea7

由 Linus Torvalds 提交于 9月 16, 2012

This reverts commit 970e1789.

Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e1789 ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.

Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.

There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.
Reported-by: NNikolay Ulyanitsky <lystor@gmail.com>
Bisected-by: NBorislav Petkov <bp@alien8.de>
Acked-by: NMike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

37407ea7

13 9月, 2012 2 次提交

sched: Fix load avg vs. cpu-hotplug · 08bedae1

由 Peter Zijlstra 提交于 9月 06, 2012

Commit f319da0c ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.
Suggested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

08bedae1

sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW · f3e94786

由 Peter Zijlstra 提交于 9月 12, 2012

Now that the last architecture to use this has stopped doing so (ARM,
thanks Catalin!) we can remove this complexity from the scheduler
core.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

f3e94786

04 9月, 2012 4 次提交

sched: Remove useless code in yield_to() · 38b8dd6f

由 Michael Wang 提交于 7月 03, 2012

It's impossible to enter the else branch if we have set
skip_clock_update in task_yield_fair(), as yield_to_task_fair()
will directly return true after invoke task_yield_fair().
Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Acked-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

38b8dd6f

sched/debug: Limit sd->*_idx range on sysctl · 201c373e

由 Namhyung Kim 提交于 8月 16, 2012

Various sd->*_idx's are used for refering the rq's load average table
when selecting a cpu to run. However they can be set to any number
with sysctl knobs so that it can crash the kernel if something bad is
given. Fix it by limiting them into the actual range.
Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

201c373e

sched: Unthrottle rt runqueues in __disable_runtime() · a4c96ae3

由 Peter Boonstoppel 提交于 8月 09, 2012

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()
Signed-off-by: NPeter Boonstoppel <pboonstoppel@nvidia.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a4c96ae3

sched: Fix load avg vs cpu-hotplug · f319da0c

由 Peter Zijlstra 提交于 8月 20, 2012

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.
Reported-by: NRakib Mullick <rakib.mullick@gmail.com>
Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

f319da0c

20 8月, 2012 2 次提交

cputime: Consolidate vtime handling on context switch · baa36046

由 Frederic Weisbecker 提交于 6月 18, 2012

The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.

These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.

Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>

baa36046

sched: Move cputime code to its own file · 73fbec60

由 Frederic Weisbecker 提交于 6月 16, 2012

Extract cputime code from the giant sched/core.c and
put it in its own file. This make it easier to deal with
this particular area and de-bloat a bit more core.c
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>

73fbec60

14 8月, 2012 4 次提交

sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean up · f03542a7

由 Alex Shi 提交于 7月 26, 2012

Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.

So, clean up the obsolete code, includes SD_PREFER_LOCAL.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

f03542a7

sched: Document schedule() entry points · edde96ea

由 Pekka Enberg 提交于 8月 04, 2012

This patch adds a comment on top of the schedule() function to explain
to scheduler newbies how the main scheduler function is entered.
Acked-by: NRandy Dunlap <rdunlap@xenotime.net>
Explained-by: NIngo Molnar <mingo@kernel.org>
Explained-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NPekka Enberg <penberg@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

edde96ea

sched,cgroup: Fix up task_groups list · 35cf4e50

由 Mike Galbraith 提交于 8月 07, 2012

With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance.  This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Cc: stable@kernel.org # v3.3+
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

35cf4e50

sched: fix divide by zero at {thread_group,task}_times · bea6832c

由 Stanislaw Gruszka 提交于 8月 08, 2012

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

  PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
   #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
   #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
   #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
   #3 [ffff880472a51cd0] die at ffffffff8100f26b
   #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
   #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
   #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
      [exception RIP: thread_group_times+0x56]
      RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
      RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
      RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
      RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
      R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
   #8 [ffff880472a51f40] sys_times at ffffffff81088524
   #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
      RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
      RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
      RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
      RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
      R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
      R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
      ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b

Cc: stable@vger.kernel.org
Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

bea6832c

26 7月, 2012 2 次提交

sched: Deliver sched_switch events to the current task · 895dd92c

由 Andrew Vagin 提交于 7月 12, 2012

Otherwise they can't be filtered for a defined task:

  perf record -e sched:sched_switch ./foo

This command doesn't report any events without this patch.

I think it isn't a security concern if someone knows who will
be executed next - this can already be observed by polling /proc
state. By default perf is disabled for non-root users in any case.

I need these events for profiling sleep times.  sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.
Signed-off-by: NAndrew Vagin <avagin@openvz.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arun Sharma <asharma@fb.com>
Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

895dd92c

sched: Use task_rq_unlock() in __sched_setscheduler() · 45afb173

由 Namhyung Kim 提交于 7月 07, 2012

It seems there's no specific reason to open-code it. I guess
commit 0122ec5b ("sched: Add p->pi_lock to task_rq_lock()")
simply missed it. Let's be consistent with others.
Signed-off-by: NNamhyung Kim <namhyung@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341647342-6742-1-git-send-email-namhyung@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

45afb173