提交 · 143e1e28cb40bed836b0a06567208bd7347c9672 · openanolis / cloud-kernel

07 5月, 2014 11 次提交

sched: Rework sched_domain topology definition · 143e1e28

由 Vincent Guittot 提交于 4月 11, 2014

We replace the old way to configure the scheduler topology with a new method
which enables a platform to declare additionnal level (if needed).

We still have a default topology table definition that can be used by platform
that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
be overwritten by an arch which either wants to add new level where a load
balance make sense like BOOK or powergating level or wants to change the flags
configuration of some levels.

For each level, we need a function pointer that returns cpumask for each cpu,
a function pointer that returns the flags for the level and a name. Only flags
that describe topology, can be set by an architecture. The current topology
flags are:

 SD_SHARE_CPUPOWER
 SD_SHARE_PKG_RESOURCES
 SD_NUMA
 SD_ASYM_PACKING

Then, each level must be a subset on the next one. The build sequence of the
sched_domain will take care of removing useless levels like those with 1 CPU
and those with the same CPU span and no more relevant information for
load balancing than its children.
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <hanjun.guo@linaro.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux390@de.ibm.com
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

143e1e28

sched/numa: Do not set preferred_node on migration to a second choice node · 68d1b02a

由 Rik van Riel 提交于 4月 11, 2014

Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.

On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as much. Since every node tends to be
directly connected to every other node, running on the wrong
node for a while does not do much damage.

However, on an 8 node system, there are far more bad nodes
than there are good ones, and pretending that a second choice
is actually the preferred node can greatly delay, or even
prevent, a workload from converging.

The only time we can safely pretend that a second choice
node is the preferred node is when the task is part of a
workload that spans multiple NUMA nodes.
Signed-off-by: NRik van Riel <riel@redhat.com>
Tested-by: NVinod Chegu <chegu_vinod@hp.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-4-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

68d1b02a

sched/numa: Retry placement more frequently when misplaced · 5085e2a3

由 Rik van Riel 提交于 4月 11, 2014

When tasks have not converged on their preferred nodes yet, we want
to retry fairly often, to make sure we do not migrate a task's memory
to an undesirable location, only to have to move it again later.

This patch reduces the interval at which migration is retried,
when the task's numa_scan_period is small.
Signed-off-by: NRik van Riel <riel@redhat.com>
Tested-by: NVinod Chegu <chegu_vinod@hp.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-3-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5085e2a3

sched/numa: Count pages on active node as local · 792568ec

由 Rik van Riel 提交于 4月 11, 2014

The NUMA code is smart enough to distribute the memory of workloads
that span multiple NUMA nodes across those NUMA nodes.

However, it still has a pretty high scan rate for such workloads,
because any memory that is left on a node other than the node of
the CPU that faulted on the memory is counted as non-local, which
causes the scan rate to go up.

Counting the memory on any node where the task's numa group is
actively running as local, allows the scan rate to slow down
once the application is settled in.

This should reduce the overhead of the automatic NUMA placement
code, when a workload spans multiple NUMA nodes.
Signed-off-by: NRik van Riel <riel@redhat.com>
Tested-by: NVinod Chegu <chegu_vinod@hp.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1397235629-16328-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

792568ec

sched/numa: Initialize newidle balance stats in sd_numa_init() · 2b4cfe64

由 Jason Low 提交于 4月 23, 2014

Also initialize the per-sd variables for newidle load balancing
in sd_numa_init().
Signed-off-by: NJason Low <jason.low2@hp.com>
Acked-by: morten.rasmussen@arm.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: aswin@hp.com
Cc: chegu_vinod@hp.com
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1398303035-18255-3-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2b4cfe64

sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance() · 0e5b5337

由 Jason Low 提交于 4月 28, 2014

The following commit:

  e5fc6611 ("sched: Fix race in idle_balance()")

can potentially cause rq->max_idle_balance_cost to not be updated,
even when load_balance(NEWLY_IDLE) is attempted and the per-sd
max cost value is updated.

Preeti noticed a similar issue with updating rq->next_balance.

In this patch, we fix this by making sure we still check/update those values
even if a task gets enqueued while browsing the domains.
Signed-off-by: NJason Low <jason.low2@hp.com>
Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1398725155-7591-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0e5b5337

sched: Skip double execution of pick_next_task_fair() · 6ccdc84b

由 Peter Zijlstra 提交于 4月 24, 2014

Tim wrote:

 "The current code will call pick_next_task_fair a second time in the
  slow path if we did not pull any task in our first try.  This is
  really unnecessary as we already know no task can be pulled and it
  doubles the delay for the cpu to enter idle.

  We instrumented some network workloads and that saw that
  pick_next_task_fair is frequently called twice before a cpu enters
  idle.  The call to pick_next_task_fair can add non trivial latency as
  it calls load_balance which runs find_busiest_group on an hierarchy of
  sched domains spanning the cpus for a large system.  For some 4 socket
  systems, we saw almost 0.25 msec spent per call of pick_next_task_fair
  before a cpu can be idled."

Optimize the second call away for the common case and document the
dependency.
Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Link: http://lkml.kernel.org/r/20140424100047.GP11096@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

6ccdc84b

sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check · 6227cb00

由 Steven Rostedt (Red Hat) 提交于 4月 13, 2014

The check at the beginning of cpupri_find() makes sure that the task_pri
variable does not exceed the cp->pri_to_cpu array length. But that length
is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two
priorities in that array.

As task_pri is computed from convert_prio() which should never be bigger
than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is
hit.
Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

6227cb00

sched/deadline: Fix memory leak · 6a7cd273

由 Li Zefan 提交于 4月 17, 2014

Free cpudl->free_cpus allocated in cpudl_init().
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NJuri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org> # 3.14+
Link: http://lkml.kernel.org/r/534F36CE.2000409@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6a7cd273

sched/deadline: Fix sched_yield() behavior · 5bfd126e

由 Juri Lelli 提交于 4月 15, 2014

yield_task_dl() is broken:

 o it forces current to be throttled setting its runtime to zero;
 o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
   will queue it back with proper parameters at replenish time.

Unfortunately, dl_task_timer() has this check at the very beginning:

	if (!dl_task(p) || dl_se->dl_new)
		goto unlock;

So, it just bails out and the task is never replenished. It actually
yielded forever.

To fix this, introduce a new flag indicating that the task properly yielded
the CPU before its current runtime expired. While this is a little overdoing
at the moment, the flag would be useful in the future to discriminate between
"good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
and "bad" jobs (for which dl_throttled task has been set) that needed to be
stopped.
Reported-by: Nyjay.kim <yjay.kim@lge.com>
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5bfd126e

sched: Sanitize irq accounting madness · 2d513868

由 Thomas Gleixner 提交于 5月 02, 2014

Russell reported, that irqtime_account_idle_ticks() takes ages due to:

       for (i = 0; i < ticks; i++)
               irqtime_account_process_tick(current, 0, rq);

It's sad, that this code was written way _AFTER_ the NOHZ idle
functionality was available. I charge myself guitly for not paying
attention when that crap got merged with commit abb74cef ("sched:
Export ns irqtimes through /proc/stat")

So instead of looping nr_ticks times just apply the whole thing at
once.

As a side note: The whole cputime_t vs. u64 business in that context
wants to be cleaned up as well. There is no point in having all these
back and forth conversions. Lets standardise on u64 nsec for all
kernel internal accounting and be done with it. Everything else does
not make sense at all for fine grained accounting. Frederic, can you
please take care of that?
Reported-by: NRussell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Shaun Ruffell <sruffell@digium.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

2d513868

24 4月, 2014 1 次提交

sched/docbook: Fix 'make htmldocs' warnings caused by missing description · db66d756

由 Masanari Iida 提交于 4月 18, 2014

When 'flags' argument to sched_{set,get}attr() syscalls were
added in:

6d35ab48 ("sched: Add 'flags' argument to sched_{set,get}attr() syscalls")

no description for 'flags' was added. It causes the following warnings on "make htmldocs":

Warning(/kernel/sched/core.c:3645): No description found for parameter 'flags'
Warning(/kernel/sched/core.c:3789): No description found for parameter 'flags'
Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1397753955-2914-1-git-send-email-standby24x7@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

db66d756

18 4月, 2014 6 次提交

sched: Revert commit ("sched/core: Fix endless loop in pick_next_task()") · 46383648

由 Kirill Tkhai 提交于 3月 15, 2014

This reverts commit 4c6c4e38 ("sched/core: Fix endless loop in
pick_next_task()"), which is not necessary after ("sched/rt: Substract number
of tasks of throttled queues from rq->nr_running").
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
[conflict resolution with stop task checking patch]
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394835307.18748.34.camel@HP-250-G1-Notebook-PC
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

46383648

sched/rt: Substract number of tasks of throttled queues from rq->nr_running · f4ebcbc0

由 Kirill Tkhai 提交于 3月 15, 2014

Now rq->rt becomes to be able to be in dequeued or enqueued state.
We add new member rt_rq->rt_queued, which is used to indicate this.
The member is used only for top queue rq->rt_rq.

The goal is to fit generic scheme which is used in deadline and
fair classes, i.e. throttled rt_rq's rt_nr_running is beeing
substracted from rq->nr_running.
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394835300.18748.33.camel@HP-250-G1-Notebook-PC
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

f4ebcbc0

sched/rt: Add accessors rq_of_rt_se() · 653d07a6

由 Kirill Tkhai 提交于 3月 15, 2014

Two accessors for RT_GROUP_SCHED and !RT_GROUP_SCHED cases.
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394835295.18748.32.camel@HP-250-G1-Notebook-PC
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

653d07a6

sched/rt: Sum number of all children tasks in hierarhy at ->rt_nr_running · 22abdef3

由 Kirill Tkhai 提交于 3月 15, 2014

{inc,dec}_rt_tasks() used to count entities which are directly queued
on the rt_rq. If an entity was not a task (i.e., it is some queue), its
children were not counted.

There is no problem here, but now we want to count number of all tasks
which are actually queued under the rt_rq in all the hierarchy (except
throttled rt queues).

Empty queues are not able to be queued and all of the places, which
use ->rt_nr_running, just compare it with zero, so we do not break
anything here.
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394835289.18748.31.camel@HP-250-G1-Notebook-PC
Cc: linux-kernel@vger.kernel.org
[ Twiddled the changelog. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

22abdef3

sched/rt: Do not try to push tasks if pinned task switches to RT · 10447917

由 Kirill V Tkhai 提交于 3月 12, 2014

Just switched pinned task is not able to be pushed. If the rq had had
several RT tasks before they have already been considered as candidates
to be pushed (or pulled).
Signed-off-by: NKirill V Tkhai <tkhai@yandex.ru>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140312061833.3a43aa64@gandalf.local.home
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

10447917

sched: Make scale_rt_power() deal with backward clocks · cadefd3d

由 Peter Zijlstra 提交于 2月 27, 2014

Mike reported that, while unlikely, its entirely possible for
scale_rt_power() to see the time go backwards. This yields rather
'interesting' results.

So like all other sites that deal with clocks; make this one ignore
backward clock movement too.
Reported-by: NMike Galbraith <bitbucket@online.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140227094035.GZ9987@twins.programming.kicks-ass.net
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

cadefd3d

17 4月, 2014 1 次提交

sched: Check for stop task appearance when balancing happens · a1d9a323

由 Kirill Tkhai 提交于 4月 10, 2014

We need to do it like we do for the other higher priority classes..
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/336561397137116@web27h.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1d9a323

11 4月, 2014 1 次提交

sched/numa: Fix task_numa_free() lockdep splat · 60e69eed

由 Mike Galbraith 提交于 4月 07, 2014

Sasha reported that lockdep claims that the following commit:
made numa_group.lock interrupt unsafe:

  156654f4 ("sched/numa: Move task_numa_free() to __put_task_struct()")

While I don't see how that could be, given the commit in question moved
task_numa_free() from one irq enabled region to another, the below does
make both gripes and lockups upon gripe with numa=fake=4 go away.
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Fixes: 156654f4 ("sched/numa: Move task_numa_free() to __put_task_struct()")
Signed-off-by: NMike Galbraith <bitbucket@online.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org
Cc: mgorman@suse.com
Cc: akpm@linux-foundation.org
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/1396860915.5170.5.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

60e69eed

08 4月, 2014 2 次提交

kernel: use macros from compiler.h instead of __attribute__((...)) · 52f5684c

由 Gideon Israel Dsouza 提交于 4月 07, 2014

To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs.  Eg: __weak for
__attribute__((weak)).  I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.
Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52f5684c

sched: remove sleep_on() and friends · b8780c36

由 Arnd Bergmann 提交于 4月 07, 2014

This is the final piece in the puzzle, as all patches to remove the
last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
into the 3.15 merge window. The work was long overdue, and this
interface in particular should not have survived the BKL removal
that was done a couple of years ago.

Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":

 "[...] it was suggested that the janitors look for and fix all code
  that calls sleep_on() [...] since (1) almost all such code is
  incorrect, and (2) Linus has agreed that those functions should
  be removed in the 2.5 development series".

We haven't quite made it for 2.5, but maybe we can merge this for 3.15.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8780c36

04 4月, 2014 1 次提交

kernel: audit/fix non-modular users of module_init in core code · c96d6660

由 Paul Gortmaker 提交于 4月 03, 2014

Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular.  So using module_init as an
alias for __initcall can be somewhat misleading.

Fix these up now, so that we can relocate module_init from init.h into
module.h in the future.  If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.

The audit targets the following module_init users for change:
 kernel/user.c                  obj-y
 kernel/kexec.c                 bool KEXEC (one instance per arch)
 kernel/profile.c               bool PROFILING
 kernel/hung_task.c             bool DETECT_HUNG_TASK
 kernel/sched/stats.c           bool SCHEDSTATS
 kernel/user_namespace.c        bool USER_NS

Note that direct use of __initcall is discouraged, vs.  one of the
priority categorized subgroups.  As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e.  slightly earlier).  However no observable impact of that
difference has been observed during testing.

Also, two instances of missing ";" at EOL are fixed in kexec.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c96d6660

20 3月, 2014 1 次提交

timer: Remove code redundancy while calling get_nohz_timer_target() · 6201b4d6

由 Viresh Kumar 提交于 3月 18, 2014

There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
call it under same circumstances, i.e.

	#ifdef CONFIG_NO_HZ_COMMON
	       if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
	               return get_nohz_timer_target();
	#endif

So, it makes more sense to get all this as part of get_nohz_timer_target()
instead of duplicating code at two places. For this another parameter is
required to be passed to this routine, pinned.
Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

6201b4d6

13 3月, 2014 2 次提交

sched: Remove needless round trip nsecs <-> tick conversion of steal time · 300a9d88

由 Frederic Weisbecker 提交于 3月 05, 2014

When update_rq_clock_task() accounts the pending steal time for a task,
it converts the steal delta from nsecs to tick then from tick to nsecs.

There is no apparent good reason for doing that though because both
the task clock and the prev steal delta are u64 and store values
in nsecs.

So lets remove the needless conversion.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

300a9d88

cputime: Fix jiffies based cputime assumption on steal accounting · dee08a72

由 Frederic Weisbecker 提交于 3月 05, 2014

The steal guest time accounting code assumes that cputime_t is based on
jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
is based on nsecs, steal_account_process_tick() passes the delta in
jiffies to account_steal_time() which then accounts it as if it's a
value in nsecs.

As a result, accounting 1 second of steal time (with HZ=100 that would
be 100 jiffies) is spuriously accounted as 100 nsecs.

As such /proc/stat may report 0 values of steal time even when two
guests have run concurrently for a few seconds on the same host and
same CPU.

In order to fix this, lets convert the nsecs based steal delta to
cputime instead of jiffies by using the right conversion API.

Given that the steal time is stored in cputime_t and this type can have
a smaller granularity than nsecs, we only account the rounded converted
value and leave the remaining nsecs for the next deltas.
Reported-by: NHuiqingding <huding@redhat.com>
Reported-by: NMarcelo Tosatti <mtosatti@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

dee08a72

12 3月, 2014 3 次提交

sched: Clean up the task_hot() function · 6037dd1a

由 Alex Shi 提交于 3月 12, 2014

task_hot() doesn't need the 'sched_domain' parameter, so remove it.
Signed-off-by: NAlex Shi <alex.shi@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394607111-1904-1-git-send-email-alex.shi@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

6037dd1a

sched: Remove double calculation in fix_small_imbalance() · a2cd4260

由 Vincent Guittot 提交于 3月 11, 2014

The tmp value has been already calculated in:

  scaled_busy_load_per_task =
		(busiest->load_per_task * SCHED_POWER_SCALE) /
		busiest->group_power;
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394555166-22894-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

a2cd4260

sched: Fix broken setscheduler() · 383afd09

由 Steven Rostedt 提交于 3月 11, 2014

I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:

   linux-next commit c365c292
   "sched: Consider pi boosting in setscheduler()"

And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:

 -       p->normal_prio = normal_prio(p);
 -       p->prio = rt_mutex_getprio(p);

With no

 +       p->normal_prio = normal_prio(p);
 +       p->prio = rt_mutex_getprio(p);

Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.

The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.

Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>

383afd09

11 3月, 2014 11 次提交

sched/numa: Move task_numa_free() to __put_task_struct() · 156654f4

由 Mike Galbraith 提交于 2月 28, 2014

Bad idea on -rt:

[  908.026136]  [<ffffffff8150ad6a>] rt_spin_lock_slowlock+0xaa/0x2c0
[  908.026145]  [<ffffffff8108f701>] task_numa_free+0x31/0x130
[  908.026151]  [<ffffffff8108121e>] finish_task_switch+0xce/0x100
[  908.026156]  [<ffffffff81509c0a>] thread_return+0x48/0x4ae
[  908.026160]  [<ffffffff8150a095>] schedule+0x25/0xa0
[  908.026163]  [<ffffffff8150ad95>] rt_spin_lock_slowlock+0xd5/0x2c0
[  908.026170]  [<ffffffff810658cf>] get_signal_to_deliver+0xaf/0x680
[  908.026175]  [<ffffffff8100242d>] do_signal+0x3d/0x5b0
[  908.026179]  [<ffffffff81002a30>] do_notify_resume+0x90/0xe0
[  908.026186]  [<ffffffff81513176>] int_signal+0x12/0x17
[  908.026193]  [<00007ff2a388b1d0>] 0x7ff2a388b1cf

and since upstream does not mind where we do this, be a bit nicer ...
Signed-off-by: NMike Galbraith <bitbucket@online.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

156654f4

sched/fair: Fix endless loop in idle_balance() · 35805ff8

由 Kirill Tkhai 提交于 3月 06, 2014

Check for fair tasks number to decide, that we've pulled a task.
rq's nr_running may contain throttled RT tasks.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394118975.19290.104.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

35805ff8

sched/core: Fix endless loop in pick_next_task() · 4c6c4e38

由 Kirill Tkhai 提交于 3月 06, 2014

1) Single cpu machine case.

When rq has only RT tasks, but no one of them can be picked
because of throttling, we enter in endless loop.

pick_next_task_{dl,rt} return NULL.

In pick_next_task_fair() we permanently go to retry

	if (rq->nr_running != rq->cfs.h_nr_running)
		return RETRY_TASK;

(rq->nr_running is not being decremented when rt_rq becomes
throttled).

No chances to unthrottle any rt_rq or to wake fair here,
because of rq is locked permanently and interrupts are
disabled.

2) In case of SMP this can cause a hang too. Although we unlock
   rq in idle_balance(), interrupts are still disabled.

The solution is to check for available tasks in DL and RT
classes instead of checking for sum.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394098321.19290.11.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

4c6c4e38

sched/fair: Push down check for high priority class task into idle_balance() · e4aa358b

由 Kirill Tkhai 提交于 3月 06, 2014

We close idle_exit_fair() bracket in case of we've pulled something or we've received
task of high priority class.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1394098315.19290.10.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

e4aa358b

sched/rt: Fix picking RT and DL tasks from empty queue · 734ff2a7

由 Kirill Tkhai 提交于 3月 04, 2014

The problems:

1) We check for rt_nr_running before call of put_prev_task().
   If previous task is RT, its rt_rq may become throttled
   and dequeued after this call.

In case of p is from rt->rq this just causes picking a task
from throttled queue, but in case of its rt_rq is child
we are guaranteed catch BUG_ON.

2) The same with deadline class. The only difference we operate
   on only dl_rq.

This patch fixes all the above problems and it adds a small skip in the
DL update like we've already done for RT class:

	if (unlikely((s64)delta_exec <= 0))
		return;

This will optimize sequential update_curr_dl() calls a little.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

734ff2a7

sched/idle: Add more comments to the code · a1d028bd

由 Daniel Lezcano 提交于 3月 03, 2014

The idle main function is a complex and a critical function. Added more
comments to the code.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-5-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1d028bd

sched/idle: Move idle conditions in cpuidle_idle main function · 8ca3c642

由 Daniel Lezcano 提交于 3月 03, 2014

This patch moves the condition before entering idle into the cpuidle main
function located in idle.c. That simplify the idle mainloop functions and
increase the readibility of the conditions to enter truly idle.

This patch is code reorganization and does not change the behavior of the
function.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-4-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

8ca3c642

sched/idle: Reorganize the idle loop · c8cc7d4d

由 Daniel Lezcano 提交于 3月 03, 2014

Now that we have the main cpuidle function in idle.c, move some code from
the idle mainloop to this function for the sake of clarity.

That removes if then else indentation difficult to follow when looking at the
code. This patch does not change the current behavior.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-3-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c8cc7d4d

cpuidle/idle: Move the cpuidle_idle_call function to idle.c · 30cdd69e

由 Daniel Lezcano 提交于 3月 03, 2014

The cpuidle_idle_call does nothing more than calling the three individuals
function and is no longer used by any arch specific code but only in the
cpuidle framework code.

We can move this function into the idle task code to ensure better
proximity to the scheduler code.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NNicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

30cdd69e

sched/clock: Prevent tracing recursion in sched_clock_cpu() · 96b3d28b

由 Fernando Luis Vazquez Cao 提交于 3月 06, 2014

Prevent tracing of preempt_disable/enable() in sched_clock_cpu().
When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are
traced and this causes trace_clock() users (and probably others) to
go into an infinite recursion. Systems with a stable sched_clock()
are not affected.

This problem is similar to that fixed by upstream commit 95ef1e52
("KVM guest: prevent tracing recursion with kvmclock").
Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexusSigned-off-by: NIngo Molnar <mingo@kernel.org>

96b3d28b

sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy · d44753b8

由 Juri Lelli 提交于 3月 03, 2014

Deny the use of SCHED_DEADLINE policy to unprivileged users.
Even if root users can set the policy for normal users, we
don't want the latter to be able to change their parameters
(safest behavior).
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

d44753b8

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功