提交 · ec12cb7f31e28854efae7dd6f9544e0a66379040 · openanolis / cloud-kernel

14 8月, 2011 16 次提交

sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth · ec12cb7f

由 Paul Turner 提交于 7月 21, 2011

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NNikhil Rao <ncrao@google.com>
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

ec12cb7f

sched: Validate CFS quota hierarchies · a790de99

由 Paul Turner 提交于 7月 21, 2011

Add constraints validation for CFS bandwidth hierarchies.

Validate that:
   max(child bandwidth) <= parent_bandwidth

In a quota limited hierarchy, an unconstrained entity
(e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent.

This constraint is chosen over sum(child_bandwidth) as notion of over-commit is
valuable within SCHED_OTHER.  Some basic code from the RT case is re-factored
for reuse.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.083774572@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

a790de99

sched: Introduce primitives to account for CFS bandwidth tracking · ab84d31e

由 Paul Turner 提交于 7月 21, 2011

In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.

 - The global bandwidth is per task_group, it represents a pool of unclaimed
   bandwidth that cfs_rqs can allocate from.
 - The local bandwidth is tracked per-cfs_rq, this represents allotments from
   the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
 - cpu.cfs_period_us : the bandwidth period in usecs
 - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
   to consume over period above.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NNikhil Rao <ncrao@google.com>
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

ab84d31e

sched: Implement hierarchical task accounting for SCHED_OTHER · 953bfcd1

由 Paul Turner 提交于 7月 21, 2011

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.
Signed-off-by: NPaul Turner <pjt@google.com>
Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

953bfcd1

sched/cpupri: Remove cpupri->pri_active · 5710f15b

由 Yong Zhang 提交于 8月 06, 2011

Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhySigned-off-by: NIngo Molnar <mingo@elte.hu>

5710f15b

sched/cpupri: Fix memory barriers for vec updates to always be in order · d473750b

由 Steven Rostedt 提交于 8月 05, 2011

[ This patch actually compiles. Thanks to Mike Galbraith for pointing
that out. I compiled and booted this patch with no issues. ]

Re-examining the cpupri patch, I see there's a possible race because the
update of the two priorities vec->counts are not protected by a memory
barrier.

When a RT runqueue is overloaded and wants to push an RT task to another
runqueue, it scans the RT priority vectors in a loop from lowest
priority to highest.

When we queue or dequeue an RT task that changes a runqueue's highest
priority task, we update the vectors to show that a runqueue is rated at
a different priority. To do this, we first set the new priority mask,
and increment the vec->count, and then set the old priority mask by
decrementing the vec->count.

If we are lowering the runqueue's RT priority rating, it will trigger a
RT pull, and we do not care if we miss pushing to this runqueue or not.

But if we raise the priority, but the priority is still lower than an RT
task that is looking to be pushed, we must make sure that this runqueue
is still seen by the push algorithm (the loop).

Because the loop reads from lowest to highest, and the new priority is
set before the old one is cleared, we will either see the new or old
priority set and the vector will be checked.

But! Since there's no memory barrier between the updates of the two, the
old count may be decremented first before the new count is incremented.
This means the loop may see the old count of zero and skip it, and also
the new count of zero before it was updated. A possible runqueue that
the RT task could move to could be missed.

A conditional memory barrier is placed between the vec->count updates
and is only called when both updates are done.

The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(),
as they are not needed by archs that already synchronize
atomic_inc/dec().

The smp_rmb() has been moved to be called at every iteration of the loop
so that the race between seeing the two updates is visible by each
iteration of the loop, as an arch is free to optimize the reading of
memory of the counters in the loop.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

d473750b

sched/cpupri: Remove the vec->lock · c92211d9

由 Steven Rostedt 提交于 8月 02, 2011

sched/cpupri: Remove the vec->lock

The cpupri vec->lock has been showing up as a top contention
lately. This is because of the RT push/pull logic takes an
agressive approach for migrating RT tasks. The cpupri logic is
in place to improve the performance of the push/pull when dealing
with large number CPU machines.

The problem though is a vec->lock is required, where a vec is a
global per RT priority structure. That is, if there are lots of
RT tasks at the same priority, every time they are added or removed
from the RT queue, this global vec->lock is taken. Now that more
kernel threads are becoming RT (RCU boost and threaded interrupts)
this is becoming much more of an issue.

There are two variables that are being synced by the vec->lock.
The cpupri bitmask, and the vec->counter. The cpupri bitmask
is one bit per priority. If a RT priority vec has a process queued,
then the vec->count is > 0 and the cpupri bitmask is set for that
RT priority.

If the cpupri bitmask gets out of sync with the vec->counter, we could
end up pushing a low proirity RT task to a high priority queue.
That RT task that could have run immediately could be queued on a
run queue with a higher priority task indefinitely.

The solution is not to use the cpupri bitmask and just look at the
vec->count directly when doing a pull. The cpupri bitmask is just
a fast way to scan the RT priorities when a pull is made. Instead
of using the bitmask, and just examine all RT priorities, and
look at the vec->counts, we could eliminate the vec->lock. The
scan of RT tasks is to find a run queue that we can push an RT task
to, and we do not push to a high priority queue, thus the scan only
needs to go from 1 to RT task->prio, and not all 100 RT priorities.

The push algorithm, which does the scan of RT priorities (and
scan of the bitmask) only happens when we have an overloaded RT run
queue (more than one RT task queued). The grabbing of the vec->lock
happens every time any RT task is queued or dequeued on the run
queue for that priority. The slowing down of the scan by not using
a bitmask is negligible by the speed up of removing the vec->lock
contention, and replacing it with an atomic counter and memory barrier.

To prove this, I wrote a patch that times both the loop and the code
that grabs the vec->locks. I passed the patches to various people
(and companies) to test and show the results. I let everyone choose
their own load to test, giving different loads on the system,
for various different setups.

Here's some of the results: (snipping to a few CPUs to not make
this change log huge, but the results were consistent across
the entire system).

System 1 (24 CPUs)

Before patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
[...]
cpu 20: loop    3057    1.766   0.061   0.642   1963.170
        vec     6782949 90.469  0.089   0.414   2811760.503
cpu 21: loop    2617    1.723   0.062   0.641   1679.074
        vec     6782810 90.499  0.089   0.291   1978499.900
cpu 22: loop    2212    1.863   0.063   0.699   1547.160
        vec     6767244 85.685  0.089   0.435   2949676.898
cpu 23: loop    2320    2.013   0.062   0.594   1380.265
        vec     6781694 87.923  0.088   0.431   2928538.224

After patch:
cpu 20: loop    2078    1.579   0.061   0.533   1108.006
        vec     6164555 5.704   0.060   0.143   885185.809
cpu 21: loop    2268    1.712   0.065   0.575   1305.248
        vec     6153376 5.558   0.060   0.187   1154960.469
cpu 22: loop    1542    1.639   0.095   0.533   823.249
        vec     6156510 5.720   0.060   0.190   1172727.232
cpu 23: loop    1650    1.733   0.068   0.545   900.781
        vec     6170784 5.533   0.060   0.167   1034287.953

All times are in microseconds. The 'loop' is the amount of time spent
doing the loop across the priorities (before patch uses bitmask).
the 'vec' is the amount of time in the code that requires grabbing
the vec->lock. The second patch just does not have the vec lock, but
encompasses the same code.

Amazingly the loop code even went down on average. The vec code went
from .5 down to .18, that's more than half the time spent!

Note, more than one test was run, but they all had the same results.

System 2 (64 CPUs)

Before patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
cpu 60: loop    0       0       0       0       0
        vec     5410840 277.954 0.084   0.782   4232895.727
cpu 61: loop    0       0       0       0       0
        vec     4915648 188.399 0.084   0.570   2803220.301
cpu 62: loop    0       0       0       0       0
        vec     5356076 276.417 0.085   0.786   4214544.548
cpu 63: loop    0       0       0       0       0
        vec     4891837 170.531 0.085   0.799   3910948.833

After patch:
cpu 60: loop    0       0       0       0       0
        vec     5365118 5.080   0.021   0.063   340490.267
cpu 61: loop    0       0       0       0       0
        vec     4898590 1.757   0.019   0.071   347903.615
cpu 62: loop    0       0       0       0       0
        vec     5737130 3.067   0.021   0.119   687108.734
cpu 63: loop    0       0       0       0       0
        vec     4903228 1.822   0.021   0.071   348506.477

The test run during the measurement did not have any (very few,
from other CPUs) RT tasks pushing. But this shows that it helped
out tremendously with the contention, as the contention happens
because the vec->lock is taken only on queuing at an RT priority,
and different CPUs that queue tasks at the same priority will
have contention.

I tested on my own 4 CPU machine with the following results:

Before patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
cpu 0:  loop    2377    1.489   0.158   0.588   1398.395
        vec     4484    770.146 2.301   4.396   19711.755
cpu 1:  loop    2169    1.962   0.160   0.576   1250.110
        vec     4425    152.769 2.297   4.030   17834.228
cpu 2:  loop    2324    1.749   0.155   0.559   1299.799
        vec     4368    779.632 2.325   4.665   20379.268
cpu 3:  loop    2325    1.629   0.157   0.561   1306.113
        vec     4650    408.782 2.394   4.348   20222.577

After patch:
CPU:    Name    Count   Max     Min     Average Total
----    ----    -----   ---     ---     ------- -----
cpu 0:  loop    2121    1.616   0.113   0.636   1349.189
        vec     4303    1.151   0.225   0.421   1811.966
cpu 1:  loop    2130    1.638   0.178   0.644   1372.927
        vec     4627    1.379   0.235   0.428   1983.648
cpu 2:  loop    2056    1.464   0.165   0.637   1310.141
        vec     4471    1.311   0.217   0.433   1937.927
cpu 3:  loop    2154    1.481   0.162   0.601   1295.083
        vec     4236    1.253   0.230   0.425   1803.008

This was running my migrate.c code that can be found at:
http://lwn.net/Articles/425763/

The migrate code does stress the RT tasks a bit. This shows that
the loop did increase a little after the patch, but not by much.
The vec code dropped dramatically. From 4.3us down to .42us.
That's a 10x improvement!
Tested-by: NMike Galbraith <mgalbraith@suse.de>
Tested-by: NLuis Claudio R. Gonçalves <lgoncalv@redhat.com>
Tested-by: Matthew Hank Sabins<msabins@linux.vnet.ibm.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Reviewed-by: NGregory Haskins <gregory.haskins@gmail.com>
Acked-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>
Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

c92211d9

sched: Use pushable_tasks to determine next highest prio · 5181f4a4

由 Steven Rostedt 提交于 6月 16, 2011

Hillf Danton proposed a patch (see link) that cleaned up the
sched_rt code that calculates the priority of the next highest priority
task to be used in finding run queues to pull from.

His patch removed the calculating of the next prio to just use the current
prio when deteriming if we should examine a run queue to pull from. The problem
with his patch was that it caused more false checks. Because we check a run
queue for pushable tasks if the current priority of that run queue is higher
in priority than the task about to run on our run queue. But after grabbing
the locks and doing the real check, we find that there may not be a task
that has a higher prio task to pull. Thus the locks were taken with nothing to
do.

I added some trace_printks() to record when and how many times the run queue
locks were taken to check for pullable tasks, compared to how many times we
pulled a task.

With the current method, it was:

3806 locks taken vs 2812 pulled tasks

With Hillf's patch:

6728 locks taken vs 2804 pulled tasks

The number of times locks were taken to pull a task went up almost double with
no more success rate.

But his patch did get me thinking. When we look at the priority of the highest
task to consider taking the locks to do a pull, a failure to pull can be one
of the following: (in order of most likely)

o RT task was pushed off already between the check and taking the lock
o Waiting RT task can not be migrated
o RT task's CPU affinity does not include the target run queue's CPU
o RT task's priority changed between the check and taking the lock

And with Hillf's patch, the thing that caused most of the failures, is
the RT task to pull was not at the right priority to pull (not greater than
the current RT task priority on the target run queue).

Most of the above cases we can't help. But the current method does not check
if the next highest prio RT task can be migrated or not, and if it can not,
we still grab the locks to do the test (we don't find out about this fact until
after we have the locks). I thought about this case, and realized that the
pushable task plist that is maintained only holds RT tasks that can migrate.
If we move the calculating of the next highest prio task from the inc/dec_rt_task()
functions into the queuing of the pushable tasks, then we only measure the
priorities of those tasks that we push, and we get this basically for free.

Not only does this patch make the code a little more efficient, it cleans it
up and makes it a little simpler.

Thanks to Hillf Danton for inspiring me on this patch.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Gregory Haskins <ghaskins@novell.com>
Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

5181f4a4

sched: Balance RT tasks when forked as well · c37495fd

由 Steven Rostedt 提交于 6月 16, 2011

When a new task is woken, the code to balance the RT task is currently
skipped in the select_task_rq() call. But it will be pushed if the rq
is currently overloaded with RT tasks anyway. The issue is that we
already queued the task, and if it does get pushed, it will have to
be dequeued and requeued on the new run queue. The advantage with
pushing it first is that we avoid this requeuing as we are pushing it
off before the task is ever queued.

See commit 318e0893 ("sched: pre-route RT tasks on wakeup")
for more details.

The return of select_task_rq() when it is not a wake up has also been
changed to return task_cpu() instead of smp_processor_id(). This is more
of a sanity because the current only other user of select_task_rq()
besides wake ups, is an exec, where task_cpu() should also be the same
as smp_processor_id(). But if it is used for other purposes, lets keep
the task on the same CPU. Why would we mant to migrate it to the current
CPU?
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hillf Danton <dhillf@gmail.com>
Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

c37495fd

sched: Remove resetting exec_start in put_prev_task_rt() · 1812a643

由 Hillf Danton 提交于 6月 16, 2011

There's no reason to clean the exec_start in put_prev_task_rt() as it is reset
when the task gets back to the run queue. This saves us doing a store() in the
fast path.
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1812a643

sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task() · 311e800e

由 Hillf Danton 提交于 6月 16, 2011

Do not call dequeue_pushable_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NMike Galbraith <mgalbraith@gmx.de>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yong Zhang <yong.zhang0@gmail.com>
Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.netSigned-off-by: NIngo Molnar <mingo@elte.hu>

311e800e

sched: Remove noop in lowest_flag_domain() · 08354716

由 Hillf Danton 提交于 6月 16, 2011

Checking for the validity of sd is removed, since it is already
checked by the for_each_domain macro.
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

08354716

sched: Remove noop in next_prio() · 67d95538

由 Hillf Danton 提交于 6月 16, 2011

When computing the next priority for a given run-queue, the check for
RT priority of the task determined by the pick_next_highest_task_rt()
function could be removed, since only RT tasks are returned by the
function.
Reviewed-by: NYong Zhang <yong.zhang0@gmail.com>
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

67d95538

sched: fix broken SCHED_RESET_ON_FORK handling · c350a04e

由 Mike Galbraith 提交于 7月 27, 2011

Setting child->prio = current->normal_prio _after_ SCHED_RESET_ON_FORK has
been handled for an RT parent gives birth to a deranged mutant child with
non-RT policy, but RT prio and sched_class.

Move PI leakage protection up, always set priorities and weight, and if the
child is leaving RT class, reset rt_priority to the proper value.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311779695.8691.2.camel@marge.simson.netSigned-off-by: NIngo Molnar <mingo@elte.hu>

c350a04e

sched: Kill WAKEUP_PREEMPT · 2c2efaed

由 Yong Zhang 提交于 7月 29, 2011

Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
and its outlived its use by a long long while.
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Acked-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhySigned-off-by: NIngo Molnar <mingo@elte.hu>

2c2efaed

sched: Remove rq->avg_load_per_task · e2b245f8

由 Jan H. Schönherr 提交于 8月 01, 2011

Since commit a2d47777 ("sched: fix stale value in average load per task")
the variable rq->avg_load_per_task is no longer required. Remove it.
Signed-off-by: NJan H. Schönherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1312189408-17172-1-git-send-email-schnhrr@cs.tu-berlin.deSigned-off-by: NIngo Molnar <mingo@elte.hu>

e2b245f8

12 8月, 2011 1 次提交

move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997

由 Vasiliy Kulikov 提交于 8月 08, 2011

The patch http://lkml.org/lkml/2003/7/13/226 introduced an RLIMIT_NPROC
check in set_user() to check for NPROC exceeding via setuid() and
similar functions.

Before the check there was a possibility to greatly exceed the allowed
number of processes by an unprivileged user if the program relied on
rlimit only.  But the check created new security threat: many poorly
written programs simply don't check setuid() return code and believe it
cannot fail if executed with root privileges.  So, the check is removed
in this patch because of too often privilege escalations related to
buggy programs.

The NPROC can still be enforced in the common code flow of daemons
spawning user processes.  Most of daemons do fork()+setuid()+execve().
The check introduced in execve() (1) enforces the same limit as in
setuid() and (2) doesn't create similar security issues.

Neil Brown suggested to track what specific process has exceeded the
limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
this process would fail on execve(), and other processes' execve()
behaviour is not changed.

Solar Designer suggested to re-check whether NPROC limit is still
exceeded at the moment of execve().  If the process was sleeping for
days between set*uid() and execve(), and the NPROC counter step down
under the limit, the defered execve() failure because NPROC limit was
exceeded days ago would be unexpected.  If the limit is not exceeded
anymore, we clear the flag on successful calls to execve() and fork().

The flag is also cleared on successful calls to set_user() as the limit
was exceeded for the previous user, not the current one.

Similar check was introduced in -ow patches (without the process flag).

v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
Reviewed-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
Acked-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

72fa5997

10 8月, 2011 1 次提交

cap_syslog: don't use WARN_ONCE for CAP_SYS_ADMIN deprecation warning · f2c0d026

由 Jonathan Nieder 提交于 8月 08, 2011

syslog-ng versions before 3.3.0beta1 (2011-05-12) assume that
CAP_SYS_ADMIN is sufficient to access syslog, so ever since CAP_SYSLOG
was introduced (2010-11-25) they have triggered a warning.

Commit ee24aebf ("cap_syslog: accept CAP_SYS_ADMIN for now")
improved matters a little by making syslog-ng work again, just keeping
the WARN_ONCE().  But still, this is a warning that writes a stack trace
we don't care about to syslog, sets a taint flag, and alarms sysadmins
when nothing worse has happened than use of an old userspace with a
recent kernel.

Convert the WARN_ONCE to a printk_once to avoid that while continuing to
give userspace developers a hint that this is an unwanted
backward-compatibility feature and won't be around forever.
Reported-by: NRalf Hildebrandt <ralf.hildebrandt@charite.de>
Reported-by: NNiels <zorglub_olsen@hotmail.com>
Reported-by: NPaweł Sikora <pluto@agmk.net>
Signed-off-by: NJonathan Nieder <jrnieder@gmail.com>
Liked-by: NGergely Nagy <algernon@madhouse-project.org>
Acked-by: NSerge Hallyn <serge@hallyn.com>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2c0d026

06 8月, 2011 1 次提交

jump label: Reduce the cycle count by changing the link order · b77f0f3c

由 Jason Baron 提交于 8月 05, 2011

In the course of testing jump labels for use with the CFS
bandwidth controller, Paul Turner, discovered that using jump
labels reduced the branch count and the instruction count, but
did not reduce the cycle count or wall time.

I noticed that having the jump_label.o included in the kernel
but not used in any way still caused this increase in cycle
count and wall time. Thus, I moved jump_label.o in the
kernel/Makefile, thus changing the link order, and presumably
moving it out of hot icache areas. This brought down the cycle
count/time as expected.

In addition to Paul's testing,  I've tested the patch using a
single 'static_branch()' in the getppid() path, and basically
running tight loops of calls to getppid(). Here are my results
for the branch disabled case:

With jump labels turned on (CONFIG_JUMP_LABEL), branch disabled:

 Performance counter stats for 'bash -c /tmp/getppid;true' (50 runs):

     3,969,510,217 instructions             #	   0.864 IPC     ( +-0.000% )
     4,592,334,954 cycles                     ( +-   0.046% )
       751,634,470 branches                   ( +-   0.000% )

        1.722635797  seconds time elapsed   ( +-   0.046% )

Jump labels turned off (CONFIG_JUMP_LABEL not set), branch
disabled:

 Performance counter stats for 'bash -c /tmp/getppid;true' (50 runs):

     4,009,611,846 instructions             #	   0.867 IPC     ( +-0.000% )
     4,622,210,580 cycles                     ( +-   0.012% )
       771,662,904 branches                   ( +-   0.000% )

        1.734341454  seconds time elapsed   ( +-   0.022% )
Signed-off-by: NJason Baron <jbaron@redhat.com>
Cc: rth@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/20110805204040.GG2522@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
Tested-by: NPaul Turner <pjt@google.com>

b77f0f3c

04 8月, 2011 6 次提交

lockdep: Clear whole lockdep_map on initialization · f59de899

由 Tejun Heo 提交于 7月 14, 2011

lockdep_init_map() only initializes parts of lockdep_map and triggers
kmemcheck warning when it is copied as a whole. There isn't anything
to be gained by clearing selectively. memset() the whole structure
and remove loop for ->class_cache[] clearing.

Addresses https://bugzilla.kernel.org/show_bug.cgi?id=35532Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NChristian Casteyde <casteyde.christian@free.fr>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=35532Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110714131909.GJ3455@htj.dyndns.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

f59de899

lockdep: Fix up warning · 70a0686a

由 Peter Zijlstra 提交于 7月 25, 2011

On Sun, 2011-07-24 at 21:06 -0400, Arnaud Lacombe wrote:

> /src/linux/linux/kernel/lockdep.c: In function 'mark_held_locks':
> /src/linux/linux/kernel/lockdep.c:2471:31: warning: comparison of
> distinct pointer types lacks a cast

The warning is harmless in this case, but the below makes it go away.
Reported-by: NArnaud Lacombe <lacombar@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311588599.2617.56.camel@laptopSigned-off-by: NIngo Molnar <mingo@elte.hu>

70a0686a

lockdep: Fix trace_hardirqs_on_caller() · 7d36b26b

由 Peter Zijlstra 提交于 7月 26, 2011

Commit dd4e5d3a ("lockdep: Fix trace_[soft,hard]irqs_[on,off]()
recursion") made a bit of a mess of the various checks and error
conditions.

In particular it moved the check for !irqs_disabled() before the
spurious enable test, resulting in some warnings.
Reported-by: NArnaud Lacombe <lacombar@gmail.com>
Reported-by: NDave Jones <davej@redhat.com>
Reported-and-tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311679697.24752.28.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

7d36b26b

Boot up with usermodehelper disabled · 288d5abe

由 Linus Torvalds 提交于 8月 03, 2011

The core device layer sends tons of uevent notifications for each device
it finds, and if the kernel has been built with a non-empty
CONFIG_UEVENT_HELPER_PATH that will make us try to execute the usermode
helper binary for all these events very early in the boot.

Not only won't the root filesystem even be mounted at that point, we
literally won't have necessarily even initialized all the process
handling data structures at that point, which causes no end of silly
problems even when the usermode helper doesn't actually succeed in
executing.

So just use our existing infrastructure to disable the usermodehelpers
to make the kernel start out with them disabled.  We enable them when
we've at least initialized stuff a bit.

Problems related to an uninitialized

	init_ipc_ns.ids[IPC_SHM_IDS].rw_mutex

reported by various people.
Reported-by: NManuel Lauss <manuel.lauss@googlemail.com>
Reported-by: NRichard Weinberger <richard@nod.at>
Reported-by: NMarc Zyngier <maz@misterjones.org>
Acked-by: NKay Sievers <kay.sievers@vrfy.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

288d5abe

taskstats: add_del_listener() should ignore !valid listeners · a7295898

由 Oleg Nesterov 提交于 8月 03, 2011

When send_cpu_listeners() finds the orphaned listener it marks it as
!valid and drops listeners->sem.  Before it takes this sem for writing,
s->pid can be reused and add_del_listener() can wrongly try to re-use
this entry.

Change add_del_listener() to check ->valid = T.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a7295898

taskstats: add_del_listener() shouldn't use the wrong node · dfc428b6

由 Oleg Nesterov 提交于 8月 03, 2011

1. Commit 26c4caea "don't allow duplicate entries in listener mode"
   changed add_del_listener(REGISTER) so that "next_cpu:" can reuse the
   listener allocated for the previous cpu, this doesn't look exactly
   right even if minor.

   Change the code to kfree() in the already-registered case, this case
   is unlikely anyway so the extra kmalloc_node() shouldn't hurt but
   looke more correct and clean.

2. use the plain list_for_each_entry() instead of _safe() to scan
   listeners->list.

3. Remove the unneeded INIT_LIST_HEAD(&s->list), we are going to
   list_add(&s->list).
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: NJerome Marchand <jmarchan@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dfc428b6

02 8月, 2011 4 次提交

kdb,kgdb: Allow arbitrary kgdb magic knock sequences · 37f86b46

由 Jason Wessel 提交于 5月 24, 2011

The first packet that gdb sends when the kernel is in kdb mode seems
to change with every release of gdb.  Instead of continuing to add
many different gdb packets, change kdb to automatically look for any
thing that looks like a gdb packet.

Example 1 cold start test:
echo g > /proc/sysrq-trigger
$D#44+

Example 2 cold start test:
echo g > /proc/sysrq-trigger
$3#33

The second one should re-enter kdb's shell right away and is purely a
test.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

37f86b46

kdb: Remove all references to DOING_KGDB2 · d613d828

由 Jason Wessel 提交于 5月 23, 2011

The DOING_KGDB2 was originally a state variable for one of the two
ways to automatically transition from kdb to kgdb. Purge all these
variables and just use one single state for the transition.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

d613d828

kdb,kgdb: Implement switch and pass buffer from kdb -> gdb · f679c498

由 Jason Wessel 提交于 5月 23, 2011

When switching from kdb mode to kgdb mode packets were getting lost
depending on the size of the fifo queue of the serial chip.  When gdb
initially connects if it is in kdb mode it should entirely send any
character buffer over to the gdbstub when switching connections.

Previously kdb was zero'ing out the character buffer and this could
lead to gdb failing to connect at all, or a lengthy pause could occur
on the initial connect.
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

f679c498

kdb: cleanup unused variables missed in the original kdb merge · 3bdb65ec

由 Jason Wessel 提交于 6月 30, 2011

The BTARGS and BTSYMARG variables do not have any function in the
mainline version of kdb.
Reported-by: NTim Bird <tim.bird@am.sony.com>
Signed-off-by: NJason Wessel <jason.wessel@windriver.com>

3bdb65ec

31 7月, 2011 1 次提交

resources: Add lookup_resource() · 1c388919

由 Geert Uytterhoeven 提交于 5月 07, 2011

Add a function to find an existing resource by a resource start address.
This allows to implement simple allocators (with a malloc/free-alike API)
on top of the resource system.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>

1c388919

28 7月, 2011 4 次提交

dt/irq: add irq_domain_generate_simple() helper · 7e713301

由 Grant Likely 提交于 7月 26, 2011

irq_domain_generate_simple() is an easy way to generate an irq translation
domain for simple irq controllers.  It assumes a flat 1:1 mapping from
hardware irq number to an offset of the first linux irq number assigned
to the controller
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

7e713301

irq: add irq_domain translation infrastructure · 08a543ad

由 Grant Likely 提交于 7月 26, 2011

This patch adds irq_domain infrastructure for translating from
hardware irq numbers to linux irqs.  This is particularly important
for architectures adding device tree support because the current
implementation (excluding PowerPC and SPARC) cannot handle
translation for more than a single interrupt controller.  irq_domain
supports device tree translation for any number of interrupt
controllers.

This patch converts x86, Microblaze, ARM and MIPS to use irq_domain
for device tree irq translation.  x86 is untested beyond compiling it,
irq_domain is enabled for MIPS and Microblaze, but the old behaviour is
preserved until the core code is modified to actually register an
irq_domain yet.  On ARM it works and is required for much of the new
ARM device tree board support.

PowerPC has /not/ been converted to use this new infrastructure.  It
is still missing some features before it can replace the virq
infrastructure already in powerpc (see documentation on
irq_domain_map/unmap for details).  Followup patches will add the
missing pieces and migrate PowerPC to use irq_domain.

SPARC has its own method of managing interrupts from the device tree
and is unaffected by this change.
Acked-by: NRalf Baechle <ralf@linux-mips.org>
Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>

08a543ad

[media] v4l2-compat-ioctl32: add VIDIOC_DQEVENT support · 2330fb82

由 Hans Verkuil 提交于 6月 07, 2011

Signed-off-by: NHans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>

2330fb82

signals: sys_ssetmask/sys_rt_sigsuspend should use set_current_blocked() · c1095c6d

由 Oleg Nesterov 提交于 7月 27, 2011

sys_ssetmask(), sys_rt_sigsuspend() and compat_sys_rt_sigsuspend()
change ->blocked directly.  This is not correct, see the changelog in
e6fa16ab "signal: sigprocmask() should do retarget_shared_pending()"

Change them to use set_current_blocked().

Another change is that now we are doing ->saved_sigmask = ->blocked
lockless, it doesn't make any sense to do this under ->siglock.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Reviewed-by: NMatt Fleming <matt.fleming@linux.intel.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c1095c6d

27 7月, 2011 6 次提交

atomic: use <linux/atomic.h> · 60063497

由 Arun Sharma 提交于 7月 26, 2011

This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: NArun Sharma <asharma@fb.com>
Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NMike Frysinger <vapier@gentoo.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

60063497

panic: panic=-1 for immediate reboot · 4302fbc8

由 Hugh Dickins 提交于 7月 26, 2011

When a kernel BUG or oops occurs, ChromeOS intends to panic and
immediately reboot, with stacktrace and other messages preserved in RAM
across reboot.

But the longer we delay, the more likely the user is to poweroff and
lose the info.

panic_timeout (seconds before rebooting) is set by panic= boot option or
sysctl or /proc/sys/kernel/panic; but 0 means wait forever, so at
present we have to delay at least 1 second.

Let a negative number mean reboot immediately (with the small cosmetic
benefit of suppressing that newline-less "Rebooting in %d seconds.."
message).
Signed-off-by: NHugh Dickins <hughd@chromium.org>
Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4302fbc8

gcov: disable CONSTRUCTORS for UML · 947be5df

由 Vitaliy Ivanov 提交于 7月 26, 2011

Selecting GCOV for UML causing configuration mismatch:

  warning: (GCOV_KERNEL) selects CONSTRUCTORS which has unmet direct dependencies (!UML)

Constructors are not needed for UML.
Signed-off-by: NVitaliy Ivanov <vitalivanov@gmail.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: NRichard Weinberger <richard@nod.at>
Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

947be5df

ipc: introduce shm_rmid_forced sysctl · b34a6b1d

由 Vasiliy Kulikov 提交于 7月 26, 2011

Add support for the shm_rmid_forced sysctl.  If set to 1, all shared
memory objects in current ipc namespace will be automatically forced to
use IPC_RMID.

The POSIX way of handling shmem allows one to create shm objects and
call shmdt(), leaving shm object associated with no process, thus
consuming memory not counted via rlimits.

With shm_rmid_forced=1 the shared memory object is counted at least for
one process, so OOM killer may effectively kill the fat process holding
the shared memory.

It obviously breaks POSIX - some programs relying on the feature would
stop working.  So set shm_rmid_forced=1 only if you're sure nobody uses
"orphaned" memory.  Use shm_rmid_forced=0 by default for compatability
reasons.

The feature was previously impemented in -ow as a configure option.

[akpm@linux-foundation.org: fix documentation, per Randy]
[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: readability/conventionality tweaks]
[akpm@linux-foundation.org: fix shm_rmid_forced/shm_forced_rmid confusion, use standard comment layout]
Signed-off-by: NVasiliy Kulikov <segoon@openwall.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge.hallyn@canonical.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Solar Designer <solar@openwall.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b34a6b1d

kernel/fork.c: fix a few coding style issues · fb0a685c

由 Daniel Rebelo de Oliveira 提交于 7月 26, 2011

Signed-off-by: NDaniel Rebelo de Oliveira <psykon@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fb0a685c

cpusets: randomize node rotor used in cpuset_mem_spread_node() · 778d3b0f

由 Michal Hocko 提交于 7月 26, 2011

[ This patch has already been accepted as commit 0ac0c0d0 but later
  reverted (commit 35926ff5) because it itroduced arch specific
  __node_random which was defined only for x86 code so it broke other
  archs.  This is a followup without any arch specific code.  Other than
  that there are no functional changes.]

Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems).  Part of the reason is
that the rotor (in cpuset_mem_spread_node()) used to assign nodes starts
at node 0 for newly created tasks.

This patch changes the rotor to be initialized to a random node number
of the cpuset.

[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
[mhocko@suse.cz: Make it arch independent]
[akpm@linux-foundation.org: fix CONFIG_NUMA=y, MAX_NUMNODES>1 build]
Signed-off-by: NJack Steiner <steiner@sgi.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NMichal Hocko <mhocko@suse.cz>
Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Paul Menage <menage@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Robin Holt <holt@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Paul Menage <menage@google.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

778d3b0f

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功