提交 · 8cb9764fc88b41db11f251e8b2a0d006578b7eb4 · openeuler / raspberrypi-kernel

07 5月, 2015 2 次提交

nohz: Set isolcpus when nohz_full is set · 8cb9764f

由 Chris Metcalf 提交于 5月 06, 2015

nohz_full is only useful with isolcpus are also set, since
otherwise the scheduler has to run periodically to try to
determine whether to steal work from other cores.

Accordingly, when booting with nohz_full=xxx on the command
line, we should act as if isolcpus=xxx was also set, and set
(or extend) the isolcpus set to include the nohz_full cpus.
Signed-off-by: NChris Metcalf <cmetcalf@ezchip.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NRik van Riel <riel@redhat.com>
Acked-by: NMike Galbraith <umgwanakikbuti@gmail.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Jones <davej@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430928266-24888-5-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

8cb9764f

context_tracking: Inherit TIF_NOHZ through forks instead of context switches · fafe870f

由 Frederic Weisbecker 提交于 5月 06, 2015

TIF_NOHZ is used by context_tracking to force syscall slow-path
on every task in order to track userspace roundtrips. As such,
it must be set on all running tasks.

It's currently explicitly inherited through context switches.
There is no need to do it in this fast-path though. The flag
could simply be set once for all on all tasks, whether they are
running or not.

Lets do this by setting the flag for the init task on early boot,
and let it propagate through fork inheritance.

While at it, mark context_tracking_cpu_set() as init code, we
only need it at early boot time.
Suggested-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Dave Jones <davej@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430928266-24888-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

fafe870f

27 4月, 2015 1 次提交

x86: pvclock: Really remove the sched notifier for cross-cpu migrations · 73459e2a

由 Paolo Bonzini 提交于 4月 23, 2015

This reverts commits 0a4e6be9
and 80f7fdb1.

The task migration notifier was originally introduced in order to support
the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
with synchronized TSC.  Hence, on KVM the race condition is only needed
due to a bad implementation on the host side, and even then it's so rare
that it's mostly theoretical.

As far as KVM is concerned it's possible to fix the host, avoiding the
additional complexity in the vDSO and the (re)introduction of the task
migration notifier.

Xen, on the other hand, hasn't yet implemented vsyscall support at
all, so we do not care about its plans for non-synchronized TSC.
Reported-by: NPeter Zijlstra <peterz@infradead.org>
Suggested-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

73459e2a

03 4月, 2015 1 次提交

sched/core: Drop debugging leftover trace_printk call · 62a935b2

由 Borislav Petkov 提交于 4月 03, 2015

Commit:

  3c18d447 ("sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()")

forgot a trace_printk() debugging piece in and Steve's banner screamed
in dmesg. Remove it.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1428050570-21041-1-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

62a935b2

02 4月, 2015 2 次提交

sched/core: Check for available DL bandwidth in cpuset_cpu_inactive() · 3c18d447

由 Juri Lelli 提交于 3月 31, 2015

Hotplug operations are destructive w.r.t. cpusets. In case such an
operation is performed on a CPU belonging to an exlusive cpuset, the
DL bandwidth information associated with the corresponding root
domain is gone even if the operation fails (in sched_cpu_inactive()).

For this reason we need to move the check we currently have in
sched_cpu_inactive() to cpuset_cpu_inactive() to prevent useless
cpusets reconfiguration in the CPU_DOWN_FAILED path.
Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1427792017-7356-2-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

3c18d447

sched/core: Remove unused argument from init_[rt|dl]_rq() · 07c54f7a

由 Abel Vesa 提交于 3月 03, 2015

Obviously, 'rq' is not used in these two functions, therefore,
there is no reason for it to be passed as an argument.
Signed-off-by: NAbel Vesa <abelvesa@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

07c54f7a

27 3月, 2015 4 次提交

sched/deadline: Fix rt runtime corruption when dl fails its global constraints · a1963b81

由 Wanpeng Li 提交于 3月 17, 2015

One version of sched_rt_global_constaints() (the !rt-cgroup one)
changes state, therefore if we fail the later sched_dl_global_constraints()
call the state is left in an inconsistent state.

Fix this by changing the order of the calls.
Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
[ Improved the changelog. ]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJuri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/1426590931-4639-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1963b81

sched: Add SD_PREFER_SIBLING for SMT level · caff37ef

由 Vincent Guittot 提交于 2月 27, 2015

Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will place at least one task per core.
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-11-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

caff37ef

sched: Remove unused struct sched_group_capacity::capacity_orig · dc7ff76e

由 Vincent Guittot 提交于 3月 03, 2015

The 'struct sched_group_capacity::capacity_orig' field is no longer used
in the scheduler so we can remove it.
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

dc7ff76e

sched: Add struct rq::cpu_capacity_orig · ca6d75e6

由 Vincent Guittot 提交于 2月 27, 2015

This new field 'cpu_capacity_orig' reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used:

  - to detect when the capacity of a CPU has been noticeably reduced so we can
    trig load balance to look for a CPU with better capacity. As an example, we
    can detect when a CPU handles a significant amount of irq
    (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
    scheduler whereas CPUs, which are really idle, are available.

  - evaluate the available capacity for CFS tasks
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

ca6d75e6

24 3月, 2015 1 次提交

x86: kvm: Revert "remove sched notifier for cross-cpu migrations" · 0a4e6be9

由 Marcelo Tosatti 提交于 3月 23, 2015

The following point:

    2. per-CPU pvclock time info is updated if the
       underlying CPU changes.

Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".

Add task migration notification back.

Problem noticed by Andy Lutomirski.
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
CC: stable@kernel.org # 3.11+

0a4e6be9

23 3月, 2015 1 次提交

sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944

由 Brian Silverman 提交于 2月 18, 2015

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.
Signed-off-by: NBrian Silverman <brian@peloton-tech.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: austin@peloton-tech.com
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

746db944

20 3月, 2015 1 次提交

sched, isolcpu: make cpu_isolated_map visible outside scheduler · 3fa0818b

由 Rik van Riel 提交于 3月 09, 2015

Needed by the next patch. Also makes cpu_isolated_map present
when compiled without SMP and/or with CONFIG_NR_CPUS=1, like
the other cpu masks.

At some point we may want to clean things up so cpumasks do
not exist in UP kernels. Maybe something for the CONFIG_TINY
crowd.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clark Williams <williams@redhat.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: cgroups@vger.kernel.org
Signed-off-by: NRik van Riel <riel@redhat.com>
Acked-by: NZefan Li <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

3fa0818b

09 3月, 2015 1 次提交

context_tracking: Rename context symbols to prepare for transition state · c467ea76

由 Frederic Weisbecker 提交于 3月 04, 2015

Current context tracking symbols are designed to express living state.
As such they are prefixed with "IN_": IN_USER, IN_KERNEL.

Now we are going to use these symbols to also express state transitions
such as context_tracking_enter(IN_USER) or context_tracking_exit(IN_USER).
But while the "IN_" prefix works well to express entering a context, it's
confusing to depict a context exit: context_tracking_exit(IN_USER)
could mean two things:
	1) We are exiting the current context to enter user context.
	2) We are exiting the user context
We want 2) but the reviewer may be confused and understand 1)

So lets disambiguate these symbols and rename them to CONTEXT_USER and
CONTEXT_KERNEL.
Acked-by: NRik van Riel <riel@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will deacon <will.deacon@arm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

c467ea76

19 2月, 2015 1 次提交

sched/rt/nohz: Stop scheduler tick if running realtime task · 1e78cdbd

由 Rik van Riel 提交于 2月 16, 2015

If the CPU is running a realtime task that does not round-robin
with another realtime task of equal priority, there is no point
in keeping the scheduler tick going. After all, whenever the
scheduler tick runs, the kernel will just decide not to
reschedule.

Extend sched_can_stop_tick() to recognize these situations, and
inform the rest of the kernel that the scheduler tick can be
stopped.
Tested-by: NLuiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: NRik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: fweisbec@redhat.com
Cc: mtosatti@redhat.com
Link: http://lkml.kernel.org/r/20150216152349.6a8ed824@annuminas.surriel.com
[ Small cleanliness tweak. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

1e78cdbd

18 2月, 2015 6 次提交

sched/rt: Avoid obvious configuration fail · 2636ed5f

由 Peter Zijlstra 提交于 2月 09, 2015

Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it
would disallow the kernel creating RT tasks.

One can of course still set it to 1, which will (likely) still wreck
your kernel, but at least make it clear that setting it to 0 is not
good.

Collect both sanity checks into the one place while we're there.
Suggested-by: NZefan Li <lizefan@huawei.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150209112715.GO24151@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

2636ed5f

sched/autogroup: Fix failure to set cpu.rt_runtime_us · 1fe89e1b

由 Peter Zijlstra 提交于 2月 09, 2015

Because task_group() uses a cache of autogroup_task_group(), whose
output depends on sched_class, switching classes can generate
problems.

In particular, when started as fair, the cache points to the
autogroup, so when switching to RT the tg_rt_schedulable() test fails
for every cpu.rt_{runtime,period}_us change because now the autogroup
has tasks and no runtime.

Furthermore, going back to the previous semantics of varying
task_group() with sched_class has the down-side that the sched_debug
output varies as well, even though the task really is in the
autogroup.

Therefore add an autogroup exception to tg_has_rt_tasks() -- such that
both (all) task_group() usages in sched/core now have one. And remove
all the remnants of the variable task_group() output.
Reported-by: NZefan Li <lizefan@huawei.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Fixes: 8323f26c ("sched: Fix race in task_group()")
Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

1fe89e1b

sched: Prevent recursion in io_schedule() · 9cff8ade

由 NeilBrown 提交于 2月 13, 2015

io_schedule() calls blk_flush_plug() which, depending on the
contents of current->plug, can initiate arbitrary blk-io requests.

Note that this contrasts with blk_schedule_flush_plug() which requires
all non-trivial work to be handed off to a separate thread.

This makes it possible for io_schedule() to recurse, and initiating
block requests could possibly call mempool_alloc() which, in times of
memory pressure, uses io_schedule().

Apart from any stack usage issues, io_schedule() will not behave
correctly when called recursively as delayacct_blkio_start() does
not allow for repeated calls.

So:
 - use ->in_iowait to detect recursion.  Set it earlier, and restore
   it to the old value.
 - move the call to "raw_rq" after the call to blk_flush_plug().
   As this is some sort of per-cpu thing, we want some chance that
   we are on the right CPU
 - When io_schedule() is called recurively, use blk_schedule_flush_plug()
   which cannot further recurse.
 - as this makes io_schedule() a lot more complex and as io_schedule()
   must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
   and make io_schedule a simple wrapper for that.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
[ Moved the now rudimentary io_schedule() into sched.h. ]
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tony Battersby <tonyb@cybernetics.com>
Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>

9cff8ade

sched: Fix preempt_schedule_common() triggering tracing recursion · 06b1f808

由 Frederic Weisbecker 提交于 2月 16, 2015

Since the function graph tracer needs to disable preemption, it might
call preempt_schedule() after reenabling  it if something triggered the
need for rescheduling in between.

Therefore we can't trace preempt_schedule() itself because we would
face a function tracing recursion otherwise as the tracer is always
called before PREEMPT_ACTIVE gets set to prevent that recursion. This is
why preempt_schedule() is tagged as "notrace".

But the same issue applies to every function called by preempt_schedule()
before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is
one such example. Unfortunately we forgot to tag it as notrace as well
and as a result we are encountering tracing recursion since it got
introduced by:

   a18b5d01 ("sched: Fix missing preemption opportunity")

Let's fix that by applying the appropriate function tag to
preempt_schedule_common().
Reported-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1424110807-15057-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

06b1f808

sched: Make dl_task_time() use task_rq_lock() · 3960c8c0

由 Peter Zijlstra 提交于 2月 17, 2015

Kirill reported that a dl task can be throttled and dequeued at the
same time. This happens, when it becomes throttled in schedule(),
which is called to go to sleep:

current->state = TASK_INTERRUPTIBLE;
schedule()
    deactivate_task()
        dequeue_task_dl()
            update_curr_dl()
                start_dl_timer()
            __dequeue_task_dl()
    prev->on_rq = 0;

This invalidates the assumption from commit 0f397f2c ("sched/dl:
Fix race in dl_task_timer()"):

  "The only reason we don't strictly need ->pi_lock now is because
   we're guaranteed to have p->state == TASK_RUNNING here and are
   thus free of ttwu races".

And therefore we have to use the full task_rq_lock() here.

This further amends the fact that we forgot to update the rq lock loop
for TASK_ON_RQ_MIGRATE, from commit cca26e80 ("sched: Teach
scheduler to understand TASK_ON_RQ_MIGRATING state").
Reported-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

3960c8c0

sched: Clarify ordering between task_rq_lock() and move_queued_task() · 74b8a4cb

由 Peter Zijlstra 提交于 2月 17, 2015

There was a wee bit of confusion around the exact ordering here;
clarify things.
Reported-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20150217121258.GM5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

74b8a4cb

14 2月, 2015 1 次提交

sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 333470ee

由 Tejun Heo 提交于 2月 13, 2015

printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

333470ee

04 2月, 2015 4 次提交

sched: Pull resched loop to __schedule() callers · bfd9b2b5

由 Frederic Weisbecker 提交于 1月 28, 2015

__schedule() disables preemption during its job and re-enables it
afterward without doing a preemption check to avoid recursion.

But if an event happens after the context switch which requires
rescheduling, we need to check again if a task of a higher priority
needs the CPU. A preempt irq can raise such a situation. To handle that,
__schedule() loops on need_resched().

But preempt_schedule_*() functions, which call __schedule(), also loop
on need_resched() to handle missed preempt irqs. Hence we end up with
the same loop happening twice.

Lets simplify that by attributing the need_resched() loop responsibility
to all __schedule() callers.

There is a risk that the outer loop now handles reschedules that used
to be handled by the inner loop with the added overhead of caller details
(inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner
rescheduling loop weren't too frequent, this shouldn't matter. Especially
since the whole preemption path is now losing one loop in any case.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1422404652-29067-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

bfd9b2b5

sched: Fix hrtick_start() on UP · 86893335

由 Wanpeng Li 提交于 11月 26, 2014

The commit 177ef2a6 ("sched/deadline: Fix a precision problem in
the microseconds range") forgot to change the UP version of
hrtick_start(), do so now.
Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
Fixes: 177ef2a6 ("sched/deadline: Fix a precision problem in the microseconds range")
[ Fixed the changelog. ]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

86893335

sched/deadline: Avoid pointless __setscheduler() · 75381608

由 Wanpeng Li 提交于 11月 26, 2014

There is no need to dequeue/enqueue and push/pull if there are
no scheduling parameters changed for the DL class.

Both fair and RT classes already check if parameters changed for
them to avoid unnecessary overhead. This patch add the parameters
changed test for the DL class in order to reduce overhead.
Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
[ Fixed up the changelog. ]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

75381608

sched/deadline: Fix deadline parameter modification handling · 40767b0d

由 Peter Zijlstra 提交于 1月 28, 2015

Commit 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to
use in switched_from_dl()") removed the hrtimer_try_cancel() function
call out from init_dl_task_timer(), which gets called from
__setparam_dl().

The result is that we can now re-init the timer while its active --
this is bad and corrupts timer state.

Furthermore; changing the parameters of an active deadline task is
tricky in that you want to maintain guarantees, while immediately
effective change would allow one to circumvent the CBS guarantees --
this too is bad, as one (bad) task should not be able to affect the
others.

Rework things to avoid both problems. We only need to initialize the
timer once, so move that to __sched_fork() for new tasks.

Then make sure __setparam_dl() doesn't affect the current running
state but only updates the parameters used to calculate the next
scheduling period -- this guarantees the CBS functions as expected
(albeit slightly pessimistic).

This however means we need to make sure __dl_clear_params() needs to
reset the active state otherwise new (and tasks flipping between
classes) will not properly (re)compute their first instance.

Todo: close class flipping CBS hole.
Todo: implement delayed BW release.
Reported-by: NLuca Abeni <luca.abeni@unitn.it>
Acked-by: NJuri Lelli <juri.lelli@arm.com>
Tested-by: NLuca Abeni <luca.abeni@unitn.it>
Fixes: 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

40767b0d

02 2月, 2015 1 次提交

sched: don't cause task state changes in nested sleep debugging · 00845eb9

由 Linus Torvalds 提交于 2月 01, 2015

Commit 8eb23b9f ("sched: Debug nested sleeps") added code to report
on nested sleep conditions, which we generally want to avoid because the
inner sleeping operation can re-set the thread state to TASK_RUNNING,
but that will then cause the outer sleep loop not actually sleep when it
calls schedule.

However, that's actually valid traditional behavior, with the inner
sleep being some fairly rare case (like taking a sleeping lock that
normally doesn't actually need to sleep).

And the debug code would actually change the state of the task to
TASK_RUNNING internally, which makes that kind of traditional and
working code not work at all, because now the nested sleep doesn't just
sometimes cause the outer one to not block, but will cause it to happen
every time.

In particular, it will cause the cardbus kernel daemon (pccardd) to
basically busy-loop doing scheduling, converting a laptop into a heater,
as reported by Bruno Prémont.  But there may be other legacy uses of
that nested sleep model in other drivers that are also likely to never
get converted to the new model.

This fixes both cases:

 - don't set TASK_RUNNING when the nested condition happens (note: even
   if WARN_ONCE() only _warns_ once, the return value isn't whether the
   warning happened, but whether the condition for the warning was true.
   So despite the warning only happening once, the "if (WARN_ON(..))"
   would trigger for every nested sleep.

 - in the cases where we knowingly disable the warning by using
   "sched_annotate_sleep()", don't change the task state (that is used
   for all core scheduling decisions), instead use '->task_state_change'
   that is used for the debugging decision itself.

(Credit for the second part of the fix goes to Oleg Nesterov: "Can't we
avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the
suggested change to use 'task_state_change' as part of the test)
Reported-and-bisected-by: NBruno Prémont <bonbons@linux-vserver.org>
Tested-by: NRafael J Wysocki <rjw@rjwysocki.net>
Acked-by: NOleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>,
Cc: Ilya Dryomov <ilya.dryomov@inktank.com>,
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>,
Cc: Davidlohr Bueso <dave@stgolabs.net>,
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

00845eb9

31 1月, 2015 1 次提交

sched: Fix missing preemption opportunity · a18b5d01

由 Frederic Weisbecker 提交于 1月 22, 2015

If an interrupt fires in cond_resched(), between the call to __schedule()
and the PREEMPT_ACTIVE count decrementation, and that interrupt sets
TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
being delayed because it's interrupting a preempt-disabled area, is
usually fixed up after preemption is re-enabled back with an explicit
call to preempt_schedule().

This is what preempt_enable() does but a raw preempt count decrement as
performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
preemption check. Therefore when such a race happens, the rescheduling
is going to be delayed until the next scheduler or preemption entrypoint.
This can be a problem for scheduler latency sensitive workloads.

Lets fix that by consolidating cond_resched() with preempt_schedule()
internals.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reported-by: NIngo Molnar <mingo@kernel.org>
Original-patch-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1421946484-9298-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a18b5d01

28 1月, 2015 1 次提交

sched: Fix crash if cpuset_cpumask_can_shrink() is passed an empty cpumask · bb2bc55a

由 Mike Galbraith 提交于 1月 28, 2015

While creating an exclusive cpuset, we passed cpuset_cpumask_can_shrink()
an empty cpumask (cur), and dl_bw_of(cpumask_any(cur)) made boom with it:

 CPU: 0 PID: 6942 Comm: shield.sh Not tainted 3.19.0-master #19
 Hardware name: MEDIONPC MS-7502/MS-7502, BIOS 6.00 PG 12/26/2007
 task: ffff880224552450 ti: ffff8800caab8000 task.ti: ffff8800caab8000
 RIP: 0010:[<ffffffff81073846>]  [<ffffffff81073846>] cpuset_cpumask_can_shrink+0x56/0xb0
 [...]
 Call Trace:
  [<ffffffff810cb82a>] validate_change+0x18a/0x200
  [<ffffffff810cc877>] cpuset_write_resmask+0x3b7/0x720
  [<ffffffff810c4d58>] cgroup_file_write+0x38/0x100
  [<ffffffff811d953a>] kernfs_fop_write+0x12a/0x180
  [<ffffffff8116e1a3>] vfs_write+0xb3/0x1d0
  [<ffffffff8116ed06>] SyS_write+0x46/0xb0
  [<ffffffff8159ced6>] system_call_fastpath+0x16/0x1b
Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
Acked-by: NZefan Li <lizefan@huawei.com>
Fixes: f82f8042 ("sched/deadline: Ensure that updates to exclusive cpusets don't break AC")
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1422417235.5716.5.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

bb2bc55a

14 1月, 2015 6 次提交

perf: Avoid horrible stack usage · 86038c5e

由 Peter Zijlstra (Intel) 提交于 12月 16, 2014

Both Linus (most recent) and Steve (a while ago) reported that perf
related callbacks have massive stack bloat.

The problem is that software events need a pt_regs in order to
properly report the event location and unwind stack. And because we
could not assume one was present we allocated one on stack and filled
it with minimal bits required for operation.

Now, pt_regs is quite large, so this is undesirable. Furthermore it
turns out that most sites actually have a pt_regs pointer available,
making this even more onerous, as the stack space is pointless waste.

This patch addresses the problem by observing that software events
have well defined nesting semantics, therefore we can use static
per-cpu storage instead of on-stack.

Linus made the further observation that all but the scheduler callers
of perf_sw_event() have a pt_regs available, so we change the regular
perf_sw_event() to require a valid pt_regs (where it used to be
optional) and add perf_sw_event_sched() for the scheduler.

We have a scheduler specific call instead of a more generic _noregs()
like construct because we can assume non-recursion from the scheduler
and thereby simplify the code further (_noregs would have to put the
recursion context call inline in order to assertain which __perf_regs
element to use).

One last note on the implementation of perf_trace_buf_prepare(); we
allow .regs = NULL for those cases where we already have a pt_regs
pointer available and do not need another.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Javi Merino <javi.merino@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.cz>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Vaibhav Nagarnaik <vnagarnaik@google.com>
Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

86038c5e

sched/core: Rework rq->clock update skips · 9edfbfed

由 Peter Zijlstra 提交于 1月 05, 2015

The original purpose of rq::skip_clock_update was to avoid 'costly' clock
updates for back to back wakeup-preempt pairs. The big problem with it
has always been that the rq variable is unaware of the context and
causes indiscrimiate clock skips.

Rework the entire thing and create a sense of context by only allowing
schedule() to skip clock updates. (XXX can we measure the cost of the
added store?)

By ensuring only schedule can ever skip an update, we guarantee we're
never more than 1 tick behind on the update.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

9edfbfed

sched/core: Remove check of p->sched_class · 1b537c7d

由 Yao Dongdong 提交于 12月 29, 2014

Search all usage of p->sched_class in sched/core.c, no one check it
before use, so it seems that every task must belong to one sched_class.
Signed-off-by: NYao Dongdong <yaodongdong@huawei.com>
[ Moved the early class assignment to make it boot. ]
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1419835303-28958-1-git-send-email-yaodongdong@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1b537c7d

sched/fair: Fix sched_entity::avg::decay_count initialization · bb04159d

由 Kirill Tkhai 提交于 12月 15, 2014

Child has the same decay_count as parent. If it's not zero,
we add it to parent's cfs_rq->removed_load:

wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair().

Child's load is a just garbade after copying of parent,
it hasn't been on cfs_rq yet, and it must not be added to
cfs_rq::removed_load in migrate_task_rq_fair().

The patch moves sched_entity::avg::decay_count intialization
in sched_fork(). So, migrate_task_rq_fair() does not change
removed_load.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBen Segall <bsegall@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

bb04159d

sched/debug: Fix potential call to __ffs(0) in sched_show_task() · 1f8a7633

由 Tetsuo Handa 提交于 12月 05, 2014

"struct task_struct"->state is "volatile long" and __ffs() warns that
"Undefined if no bit exists, so code should check against 0 first."

Therefore, at expression

state = p->state ? __ffs(p->state) + 1 : 0;

in sched_show_task(), CPU might see "p->state" before "?" as "non-zero"
but "p->state" after "?" as "zero", which could result in
"state >= sizeof(stat_nam)" being true and bogus '?' is printed.

This patch changes "state" from "unsigned int" to "unsigned long" and
save "p->state" before calling __ffs(), in order to avoid potential call
to __ffs(0).
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/201412052131.GCE35924.FVHFOtLOJOMQFS@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>

1f8a7633

sched/debug: Check for stack overflow in ___might_sleep() · a8b686b3

由 Eric Sandeen 提交于 12月 16, 2014

Sometimes a "BUG: sleeping function called from invalid context"
message is not indicative of locking problems, but is the result
of a stack overflow corrupting the thread info.

Witness http://oss.sgi.com/archives/xfs/2014-02/msg00325.html
for example, which took a few go-rounds to sort out.

If we're printing the warning, things are wonky already, and
it'd be informative to check for the stack end corruption at this
point, too.
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/5490B158.4060005@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a8b686b3

23 12月, 2014 1 次提交

sched: Fix KMALLOC_MAX_SIZE overflow during cpumask allocation · b74e6278

由 Alex Thorlton 提交于 12月 18, 2014

When allocating space for load_balance_mask, in sched_init, when
CPUMASK_OFFSTACK is set, we've managed to spill over
KMALLOC_MAX_SIZE on our 6144 core machine.  The patch below
breaks up the allocations so that they don't overflow the max
alloc size.  It also allocates the masks on the the node from
which they'll most commonly be accessed, to minimize remote
accesses on NUMA machines.
Suggested-by: NGeorge Beshers <gbeshers@sgi.com>
Signed-off-by: NAlex Thorlton <athorlton@sgi.com>
Cc: George Beshers <gbeshers@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b74e6278

11 12月, 2014 1 次提交

sched_show_task: fix unsafe usage of ->real_parent · a90e984c

由 Oleg Nesterov 提交于 12月 10, 2014

rcu_read_lock() can not protect p->real_parent if release_task(p) was
already called, change sched_show_task() to check pis_alive() like other
users do.

Note: we need some helpers to cleanup the code like this.  And it seems
that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a90e984c

08 12月, 2014 1 次提交

sched: Add missing rcu protection to wake_up_all_idle_cpus · fd7de1e8

由 Andy Lutomirski 提交于 11月 29, 2014

Locklessly doing is_idle_task(rq->curr) is only okay because of
RCU protection. The older variant of the broken code checked
rq->curr == rq->idle instead and therefore didn't need RCU.

Fixes: f6be8af1 ("sched: Add new API wake_up_if_idle() to wake up the idle cpu")
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Reviewed-by: NChuansheng Liu <chuansheng.liu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

fd7de1e8

04 12月, 2014 1 次提交

context_tracking: Restore previous state in schedule_user · 7cc78f8f

由 Andy Lutomirski 提交于 12月 03, 2014

It appears that some SCHEDULE_USER (asm for schedule_user) callers
in arch/x86/kernel/entry_64.S are called from RCU kernel context,
and schedule_user will return in RCU user context.  This causes RCU
warnings and possible failures.

This is intended to be a minimal fix suitable for 3.18.
Reported-and-tested-by: NDave Jones <davej@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7cc78f8f

16 11月, 2014 1 次提交

sched: Move p->nr_cpus_allowed check to select_task_rq() · 6c1d9410

由 Wanpeng Li 提交于 11月 05, 2014

Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
This change will make fair.c, rt.c, and deadline.c all start with the
same logic.
Suggested-and-Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: "pang.xunlei" <pang.xunlei@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6c1d9410