提交 · 831830b5a2b5d413407adf380ef62fe17d6fcbf2 · OpenHarmony / kernel_linux

03 1月, 2008 1 次提交

restrict reading from /proc/<pid>/maps to those who share ->mm or can ptrace pid · 831830b5

由 Al Viro 提交于 1月 02, 2008

Contents of /proc/*/maps is sensitive and may become sensitive after
open() (e.g.  if target originally shares our ->mm and later does exec
on suid-root binary).

Check at read() (actually, ->start() of iterator) time that mm_struct
we'd grabbed and locked is
 - still the ->mm of target
 - equal to reader's ->mm or the target is ptracable by reader.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

831830b5

29 12月, 2007 1 次提交

[SERIAL]: Fix section mismatches in Sun serial console drivers. · fb445ee5

由 David S. Miller 提交于 12月 29, 2007

We're exporting an __init function, oops :-)

The core issue here is that add_preferred_console() is marked
as __init, this makes it impossible to invoke this thing from
a driver probe routine which is what the Sparc serial drivers
need to do.

There is no harm in dropping the __init marker.  This code will
actually work properly when invoked from a modular driver,
except that init will probably not pick up the console change
without some other support code.

Then we can drop the __init from sunserial_console_match()
and we're no longer exporting an __init function to modules.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fb445ee5

23 12月, 2007 1 次提交

Modules: fix memory leak of module names · d172f4ef

由 Greg Kroah-Hartman 提交于 12月 22, 2007

Due to the change in kobject name handling, the module kobject needs to
have a null release function to ensure that the name it previously set
will be properly cleaned up.

All of this wierdness goes away in 2.6.25 with the rework of the kobject
name and cleanup logic, but this is required for 2.6.24.

Thanks to Alexey Dobriyan for finding the problem, and to Kay Sievers
for pointing out the simple way to fix it after I tried many complex
ways.

Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

d172f4ef

20 12月, 2007 2 次提交

debug: add end-of-oops marker · 2c3b20e9

由 Arjan van de Ven 提交于 12月 20, 2007

Right now it's nearly impossible for parsers that collect kernel crashes
from logs or emails (such as www.kerneloops.org) to detect the
end-of-oops condition. In addition, it's not currently possible to
detect whether or not 2 oopses that look alike are actually the same
oops reported twice, or are truly two unique oopses.

This patch adds an end-of-oops marker, and makes the end marker include
a very simple 64-bit random ID to be able to detect duplicate reports.

Normally, this ID is calculated as a late_initcall() (in the hope that
at that time there is enough entropy to get a unique enough ID); however
for early oopses the oops_exit() function needs to generate the ID on
the fly.

We do this all at the _end_ of an oops printout, so this does not impact
our ability to get the most important portions of a crash out to the
console first.

[ Sidenote: the already existing oopses-since-bootup counter we print
  during crashes serves as the differentiator between multiple oopses
  that trigger during the same bootup. ]

Tested on 32-bit and 64-bit x86. Artificially injected very early
crashes as well, as expected they result in this constant ID after
multiple bootups:

  ---[ end trace ca143223eefdc828 ]---
  ---[ end trace ca143223eefdc828 ]---

because the random pools are still all zero. But it all still works
fine and causes no additional problems (which is the main goal of
instrumentation code).
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2c3b20e9

sched: rt: account the cpu time during the tick · 67e2be02

由 Peter Zijlstra 提交于 12月 20, 2007

Realtime tasks would not account their runtime during ticks. Which would lead
to:

        struct sched_param param = { .sched_priority = 10 };
        pthread_setschedparam(pthread_self(), SCHED_FIFO, &param);

	while (1) ;

Not showing up in top.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

67e2be02

19 12月, 2007 3 次提交

genirq: revert lazy irq disable for simple irqs · 971e5b35

由 Steven Rostedt 提交于 12月 18, 2007

In commit 76d21601 lazy irq disabling
was implemented, and the simple irq handler had a masking set to it.

Remy Bohmer discovered that some devices in the ARM architecture
would trigger the mask, but never unmask it. His patch to do the
unmasking was questioned by Russell King about masking simple irqs
to begin with. Looking further, it was discovered that the problems
Remy was seeing was due to improper use of the simple handler by
devices, and he later submitted patches to fix those. But the issue
that was uncovered was that the simple handler should never mask.

This patch reverts the masking in the simple handler.
Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>

971e5b35

timer: kernel/timer.c section fixes · b4be6258

由 Adrian Bunk 提交于 12月 18, 2007

This patch fixes the following section mismatches with CONFIG_HOTPLUG=n,
CONFIG_HOTPLUG_CPU=y:

...
WARNING: vmlinux.o(.text+0x41cd3): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq')
WARNING: vmlinux.o(.text+0x41d67): Section mismatch: reference to .init.data:tvec_base_done.22610 (between 'timer_cpu_notify' and 'run_timer_softirq')
...
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

b4be6258

clockevents: fix reprogramming decision in oneshot broadcast · cdc6f27d

由 Thomas Gleixner 提交于 12月 18, 2007

Resolve the following regression of a choppy, almost unusable laptop:

http://lkml.org/lkml/2007/12/7/299
http://bugzilla.kernel.org/show_bug.cgi?id=9525

A previous version of the code did the reprogramming of the broadcast
device in the return from idle code. This was removed, but the logic in
tick_handle_oneshot_broadcast() was kept the same.

When a broadcast interrupt happens we signal the expiry to all CPUs
which have an expired event. If none of the CPUs has an expired event,
which can happen in dyntick mode, then we reprogram the broadcast
device. We do not reprogram otherwise, but this is only correct if all
CPUs, which are in the idle broadcast state have been woken up.

The code ignores, that there might be pending not yet expired events on
other CPUs, which are in the idle broadcast state. So the delivery of
those events can be delayed for quite a time.

Change the tick_handle_oneshot_broadcast() function to check for CPUs,
which are in broadcast state and are not woken up by the current event,
and enforce the rearming of the broadcast device for those CPUs.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cdc6f27d

18 12月, 2007 8 次提交

sched: do not hurt SCHED_BATCH on wakeup · 6cbf1c12

由 Ingo Molnar 提交于 12月 18, 2007

measurements by Yanmin Zhang have shown that SCHED_BATCH tasks benefit
if they run the same place_entity() logic as SCHED_OTHER tasks - so
uniformize behavior in this area.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6cbf1c12

I
sched: touch softlockup watchdog after idling · 2bacec8c
由 Ingo Molnar 提交于 12月 18, 2007
```
touch softlockup watchdog after idling.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
2bacec8c

sched: sysctl, proc_dointvec_minmax() expects int values for · 73c4efd2

由 Eric Dumazet 提交于 12月 18, 2007

min_sched_granularity_ns, max_sched_granularity_ns,
min_wakeup_granularity_ns and max_wakeup_granularity_ns are declared
"unsigned long".

This is incorrect since proc_dointvec_minmax() expects plain "int" guard
values.

This bug only triggers on big endian 64 bit arches.
Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

73c4efd2

sched: mark rwsem functions as __sched for wchan/profiling · c7af77b5

由 Livio Soares 提交于 12月 18, 2007

This following commit

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fdf8cb0909b531f9ae8f9b9d7e4eb35ba3505f07

un-inlined a low-level rwsem function, but did not mark it as __sched.
The result is that it now shows up as thread wchan (which also affects
/proc/profile stats).  The following simple patch fixes this by properly
marking rwsem_down_failed_common() as a __sched function.

Also in this patch, which is up for discussion, marks down_read() and
down_write() proper as __sched.  For profiling, it is pretty much
useless to know that a semaphore is beig help - it is necessary to know
_which_ one.  By going up another frame on the stack, the information
becomes much more useful.

In summary, the below change to lib/rwsem.c should be applied; the
changes to kernel/rwsem.c could be applied if other kernel hackers agree
with my proposal that down_read()/down_write() in the profile is not
enough.

[ akpm@linux-foundation.org: build fix ]
Signed-off-by: NLivio Soares <livio@eecg.toronto.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c7af77b5

sched: fix crash on ia64, introduce task_current() · 051a1d1a

由 Dmitry Adamushko 提交于 12月 18, 2007

Some services (e.g. sched_setscheduler(), rt_mutex_setprio() and
sched_move_task()) must handle a given task differently in case it's the
'rq->curr' task on its run-queue. The task_running() interface is not
suitable for determining such tasks for platforms with one of the
following options:

#define __ARCH_WANT_UNLOCKED_CTXSW
#define __ARCH_WANT_INTERRUPTS_ON_CTXSW

Due to the fact that it makes use of 'p->oncpu == 1' as a criterion but
such a task is not necessarily 'rq->curr'.

The detailed explanation is available here:
https://lists.linux-foundation.org/pipermail/containers/2007-December/009262.htmlSigned-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Tested-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

051a1d1a

sysctl: fix ax25 checks · 64396acc

由 Eric W. Biederman 提交于 12月 17, 2007

Fix:

sysctl table check failed: /net/ax25/ax0/ax25_default_mode .3.9.1.2 Unknown
sysctl binary path
Pid: 2936, comm: kissattach Not tainted 2.6.24-rc5 #1
 [<c012ca6a>] set_fail+0x3b/0x43
 [<c012ce7a>] sysctl_check_table+0x408/0x456
 [<c012ce8e>] sysctl_check_table+0x41c/0x456
 [<c012ce8e>] sysctl_check_table+0x41c/0x456
 ...
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Cc: Bernard Pidoux <pidoux@ccr.jussieu.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

64396acc

Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" · 368d2c63

由 Nishanth Aravamudan 提交于 12月 17, 2007

This reverts commit 54f9f80d ("hugetlb:
Add hugetlb_dynamic_pool sysctl")

Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).

(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
Acked-by: NAdam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

368d2c63

hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f

由 Nishanth Aravamudan 提交于 12月 17, 2007

hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

	nr_overcommit_hugepages > 0

indicates the same administrative setting as

	hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
Acked-by: NAdam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d1c3fb1f

08 12月, 2007 4 次提交

clockevents: warn once when program_event() is called with negative expiry · 167b1de3

由 Thomas Gleixner 提交于 12月 07, 2007

The hrtimer problem with large relative timeouts resulting in a
negative expiry time went unnoticed as there is no check in the
clockevents_program_event() code. Put a check there with a WARN_ONCE
to avoid such problems in the future.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

167b1de3

hrtimers: avoid overflow for large relative timeouts · 62f0f61e

由 Thomas Gleixner 提交于 12月 07, 2007

Relative hrtimers with a large timeout value might end up as negative
timer values, when the current time is added in hrtimer_start().

This in turn is causing the clockevents_set_next() function to set an
huge timeout and sleep for quite a long time when we have a clock
source which is capable of long sleeps like HPET. With PIT this almost
goes unnoticed as the maximum delta is ~27ms. The non-hrt/nohz code
sorts this out in the next timer interrupt, so we never noticed that
problem which has been there since the first day of hrtimers.

This bug became more apparent in 2.6.24 which activates HPET on more
hardware.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

62f0f61e

sched: enable early use of sched_clock() · 8ced5f69

由 Ingo Molnar 提交于 12月 07, 2007

some platforms have sched_clock() implementations that cannot be called
very early during wakeup. If it's called it might hang or crash in hard
to debug ways. So only call update_rq_clock() [which calls sched_clock()]
if sched_init() has already been called. (rq->idle is NULL before the
scheduler is initialized.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8ced5f69

lockdep: make cli/sti annotation warnings clearer · 5f9fa8a6

由 Ingo Molnar 提交于 12月 07, 2007

make cli/sti annotation warnings easier to interpret.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

5f9fa8a6

07 12月, 2007 2 次提交

Tiny clean-up of OPROFILE/KPROBES configuration · 3743d33e

由 Linus Torvalds 提交于 12月 06, 2007

Make the Kconfig.instrumentation file a bit easier on the eyes, and use
the new ARCH_SUPPORTS_OPROFILE for x86[-64].
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3743d33e

Fix oprofile configuration breakage · 00a58253

由 Ralf Baechle 提交于 12月 06, 2007

The cleanup 09cadedb broke the oprofile
configuration for MIPS by allowing oprofile support to be built for
kernel models where oprofile doesn't have a chance in hell to work.

Just a dependecy list on a number of architectures is - surprise - broken
and should as per past discussions probably in most considered to be
broken in most cases.  So I introduce a dependency for the oprofile
configuration on ARCH_SUPPORTS_OPROFILE.
Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

00a58253

06 12月, 2007 2 次提交

Avoid potential NULL dereference in unregister_sysctl_table · f1dad166

由 Pavel Emelyanov 提交于 12月 04, 2007

register_sysctl_table() can return NULL sometimes, e.g.  when kmalloc()
returns NULL or when sysctl check fails.

I've also noticed, that many (most?) code in the kernel doesn't check for
the return value from register_sysctl_table() and later simply calls the
unregister_sysctl_table() with potentially NULL argument.

This is unlikely on a common kernel configuration, but in case we're
dealing with modules and/or fault-injection support, there's a slight
possibility of an OOPS.

Changing all the users to check for return code from the registering does
not look like a good solution - there are too many code doing this and
failure in sysctl tables registration is not a good reason to abort module
loading (in most of the cases).

So I think, that we can just have this check in unregister_sysctl_table
just to avoid accidental OOPS-es (actually, the unregister_sysctl_table()
did exactly this, before the start_unregistering() appeared).
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f1dad166

fix clone(CLONE_NEWPID) · 5cd17569

由 Eric W. Biederman 提交于 12月 04, 2007

Currently we are complicating the code in copy_process, the clone ABI, and
if we fix the bugs sys_setsid itself, with an unnecessary open coded
version of sys_setsid.

So just simplify everything and don't special case the session and pgrp of
the initial process in a pid namespace.

Having this special case actually presents to user space the classic linux
startup conditions with session == pgrp == 0 for /sbin/init.

We already handle sending signals to processes in a child pid namespace.

We need to handle sending signals to processes in a parent pid namespace
for cases like SIGCHILD and SIGIO.

This makes nothing extra visible inside a pid namespace.  So this extra
special case appears to have no redeeming merits.

Further removing this special case increases the flexibility of how we can
use pid namespaces, by not requiring the initial process in a pid namespace
to be a daemon.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5cd17569

05 12月, 2007 8 次提交

futex: correctly return -EFAULT not -EINVAL · cde898fa

由 Thomas Gleixner 提交于 12月 05, 2007

return -EFAULT not -EINVAL. Found by review.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cde898fa

lockdep: in_range() fix · 54561783

由 Oleg Nesterov 提交于 12月 05, 2007

Torsten Kaiser wrote:

| static inline int in_range(const void *start, const void *addr, const void *end)
| {
|         return addr >= start && addr <= end;
| }
| This  will return true, if addr is in the range of start (including)
| to end (including).
|
| But debug_check_no_locks_freed() seems does:
| const void *mem_to = mem_from + mem_len
| -> mem_to is the last byte of the freed range, that fits in_range
| lock_from = (void *)hlock->instance;
| -> first byte of the lock
| lock_to = (void *)(hlock->instance + 1);
| -> first byte of the next lock, not last byte of the lock that is being checked!
|
| The test is:
| if (!in_range(mem_from, lock_from, mem_to) &&
|                                         !in_range(mem_from, lock_to, mem_to))
|                         continue;
| So it tests, if the first byte of the lock is in the range that is freed ->OK
| And if the first byte of the *next* lock is in the range that is freed
| -> Not OK.

We can also simplify in_range checks, we need only 2 comparisons, not 4.
If the lock is not in memory range, it should be either at the left of range
or at the right.
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

54561783

lockdep: fix debug_show_all_locks() · 85684873

由 Ingo Molnar 提交于 12月 05, 2007

fix the oops that can be seen in:

   http://bugzilla.kernel.org/attachment.cgi?id=13828&action=view

it is not safe to print the locks of running tasks.

(even with this fix we have a small race - but this is a debug
 function after all.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

85684873

sched: style cleanups · 41a2d6cf

由 Ingo Molnar 提交于 12月 05, 2007

style cleanup of various changes that were done recently.

no code changed:

      text    data     bss     dec     hex filename
     23680    2542      28   26250    668a sched.o.before
     23680    2542      28   26250    668a sched.o.after
Signed-off-by: NIngo Molnar <mingo@elte.hu>

41a2d6cf

futex: fix for futex_wait signal stack corruption · ce6bd420

由 Steven Rostedt 提交于 12月 05, 2007

David Holmes found a bug in the -rt tree with respect to
pthread_cond_timedwait. After trying his test program on the latest git
from mainline, I found the bug was there too.  The bug he was seeing
that his test program showed, was that if one were to do a "Ctrl-Z" on a
process that was in the pthread_cond_timedwait, and then did a "bg" on
that process, it would return with a "-ETIMEDOUT" but early. That is,
the timer would go off early.

Looking into this, I found the source of the problem. And it is a rather
nasty bug at that.

Here's the relevant code from kernel/futex.c: (not in order in the file)

[...]
smlinkage long sys_futex(u32 __user *uaddr, int op, u32 val,
                          struct timespec __user *utime, u32 __user *uaddr2,
                          u32 val3)
{
        struct timespec ts;
        ktime_t t, *tp = NULL;
        u32 val2 = 0;
        int cmd = op & FUTEX_CMD_MASK;

        if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
                if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
                        return -EFAULT;
                if (!timespec_valid(&ts))
                        return -EINVAL;

                t = timespec_to_ktime(ts);
                if (cmd == FUTEX_WAIT)
                        t = ktime_add(ktime_get(), t);
                tp = &t;
        }
[...]
        return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);
}

[...]

long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
                u32 __user *uaddr2, u32 val2, u32 val3)
{
        int ret;
        int cmd = op & FUTEX_CMD_MASK;
        struct rw_semaphore *fshared = NULL;

        if (!(op & FUTEX_PRIVATE_FLAG))
                fshared = &current->mm->mmap_sem;

        switch (cmd) {
        case FUTEX_WAIT:
                ret = futex_wait(uaddr, fshared, val, timeout);

[...]

static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
                      u32 val, ktime_t *abs_time)
{
[...]
               struct restart_block *restart;
                restart = &current_thread_info()->restart_block;
                restart->fn = futex_wait_restart;
                restart->arg0 = (unsigned long)uaddr;
                restart->arg1 = (unsigned long)val;
                restart->arg2 = (unsigned long)abs_time;
                restart->arg3 = 0;
                if (fshared)
                        restart->arg3 |= ARG3_SHARED;
                return -ERESTART_RESTARTBLOCK;
[...]

static long futex_wait_restart(struct restart_block *restart)
{
        u32 __user *uaddr = (u32 __user *)restart->arg0;
        u32 val = (u32)restart->arg1;
        ktime_t *abs_time = (ktime_t *)restart->arg2;
        struct rw_semaphore *fshared = NULL;

        restart->fn = do_no_restart_syscall;
        if (restart->arg3 & ARG3_SHARED)
                fshared = &current->mm->mmap_sem;
        return (long)futex_wait(uaddr, fshared, val, abs_time);
}

So when the futex_wait is interrupt by a signal we break out of the
hrtimer code and set up or return from signal. This code does not return
back to userspace, so we set up a RESTARTBLOCK.  The bug here is that we
save the "abs_time" which is a pointer to the stack variable "ktime_t t"
from sys_futex.

This returns and unwinds the stack before we get to call our signal. On
return from the signal we go to futex_wait_restart, where we update all
the parameters for futex_wait and call it. But here we have a problem
where abs_time is no longer valid.

I verified this with print statements, and sure enough, what abs_time
was set to ends up being garbage when we get to futex_wait_restart.

The solution I did to solve this (with input from Linus Torvalds)
was to add unions to the restart_block to allow system calls to
use the restart with specific parameters.  This way the futex code now
saves the time in a 64bit value in the restart block instead of storing
it on the stack.

Note: I'm a bit nervious to add "linux/types.h" and use u32 and u64
in thread_info.h, when there's a #ifdef __KERNEL__ just below that.
Not sure what that is there for.  If this turns out to be a problem, I've
tested this with using "unsigned int" for u32 and "unsigned long long" for
u64 and it worked just the same. I'm using u32 and u64 just to be
consistent with what the futex code uses.
Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>

ce6bd420

D
[SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string. · 874a5f87
由 David S. Miller 提交于 11月 19, 2007
```
Based upon a report by Mikael Pettersson.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
```
874a5f87

sched: default to more agressive yield for SCHED_BATCH tasks · db292ca3

由 Ingo Molnar 提交于 12月 04, 2007

do more agressive yield for SCHED_BATCH tuned tasks: they are all
about throughput anyway. This allows a gentler migration path for
any apps that relied on stronger yield.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

db292ca3

sched: fix crash in sys_sched_rr_get_interval() · 77034937

由 Ingo Molnar 提交于 12月 04, 2007

Luiz Fernando N. Capitulino reported that sched_rr_get_interval()
crashes for SCHED_OTHER tasks that are on an idle runqueue.

The fix is to return a 0 timeslice for tasks that are on an idle
runqueue. (and which are not running, obviously)

this also shrinks the code a bit:

   text    data     bss     dec     hex filename
  47903    3934     336   52173    cbcd sched.o.before
  47885    3934     336   52155    cbbb sched.o.after
Reported-by: NLuiz Fernando N. Capitulino <lcapitulino@mandriva.com.br>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

77034937

04 12月, 2007 1 次提交

uml: add !UML dependencies · b00296fb

由 Al Viro 提交于 12月 01, 2007

The previous commit ("uml: keep UML Kconfig in sync with x86") is not
enough, unfortunately.  If we go that way, we need to add dependencies
on !UML for several options.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeff Dike <jdike@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b00296fb

03 12月, 2007 1 次提交

sched: cpu accounting controller (V2) · d842de87

由 Srivatsa Vaddagiri 提交于 12月 02, 2007

Commit cfb52856 removed a useful feature for
us, which provided a cpu accounting resource controller.  This feature would be
useful if someone wants to group tasks only for accounting purpose and doesnt
really want to exercise any control over their cpu consumption.

The patch below reintroduces the feature. It is based on Paul Menage's
original patch (Commit 62d0df64), with
these differences:

        - Removed load average information. I felt it needs more thought (esp
	  to deal with SMP and virtualized platforms) and can be added for
	  2.6.25 after more discussions.
        - Convert group cpu usage to be nanosecond accurate (as rest of the cfs
	  stats are) and invoke cpuacct_charge() from the respective scheduler
	  classes
	- Make accounting scalable on SMP systems by splitting the usage
	  counter to be per-cpu
	- Move the code from kernel/cpu_acct.c to kernel/sched.c (since the
	  code is not big enough to warrant a new file and also this rightly
	  needs to live inside the scheduler. Also things like accessing
	  rq->lock while reading cpu usage becomes easier if the code lived in
	  kernel/sched.c)

The patch also modifies the cpu controller not to provide the same accounting
information.
Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>

 Tested the patches on top of 2.6.24-rc3. The patches work fine. Ran
 some simple tests like cpuspin (spin on the cpu), ran several tasks in
 the same group and timed them. Compared their time stamps with
 cpuacct.usage.
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d842de87

30 11月, 2007 4 次提交

wait_task_stopped(): pass correct exit_code to wait_noreap_copyout() · e6ceb32a

由 Scott James Remnant 提交于 11月 28, 2007

In wait_task_stopped() exit_code already contains the right value for the
si_status member of siginfo, and this is simply set in the non WNOWAIT
case.

If you call waitid() with a stopped or traced process, you'll get the signal
in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the
same time, you'll get the signal << 8 | 0x7f

Pass it unchanged to wait_noreap_copyout(); we would only need to shift it
and add 0x7f if we were returning it in the user status field and that
isn't used for any function that permits WNOWAIT.
Signed-off-by: NScott James Remnant <scott@ubuntu.com>
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e6ceb32a

FRV: fix the extern declaration of kallsyms_num_syms · 9e6c1e63

由 David Howells 提交于 11月 28, 2007

Fix the extern declaration of kallsyms_num_syms to indicate that the symbol
does not reside in the small-data storage space, and so may not be accessed
relative to the small data base register.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9e6c1e63

Isolate the UTS namespace's domainname and hostname back · 32df81cb

由 Pavel Emelyanov 提交于 11月 28, 2007

Commit 7d69a1f4 ("remove CONFIG_UTS_NS
and CONFIG_IPC_NS") by Cedric Le Goater accidentally removed the code
that prevented the uts->hostname and uts->domainname values from being
overwritten from another namespace.

In other words, setting hostname/domainname via sysfs (echo xxx >
/proc/sys/kernel/(host|domain)name) cased the new value to be set in
init UTS namespace only.

Return the isolation back.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Acked-by: NCedric Le Goater <clg@fr.ibm.com>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

32df81cb

wait_task_stopped(): don't use task_pid_nr_ns() lockless · c8950783

由 Oleg Nesterov 提交于 11月 28, 2007

wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock,
we can read an already freed memory.  Use the cached pid_t value.
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Looks-good-to: Roland McGrath <roland@redhat.com>
Acked-by: NPavel Emelyanov <xemul@openvz.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c8950783

28 11月, 2007 2 次提交
- I
  sched: clean up kernel/sched_stat.h · f95e0d1c
  由 Ingo Molnar 提交于 11月 28, 2007
```
clean up kernel/sched_stat.h.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
  f95e0d1c
- I
  sched: clean up overlong line in kernel/sched_debug.c · c1a89740
  由 Ingo Molnar 提交于 11月 28, 2007
```
clean up overlong line in kernel/sched_debug.c.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
  c1a89740

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年