提交 · d84f4f992cbd76e8f39c488cf0c5d123843923b1 · openeuler / raspberrypi-kernel

14 11月, 2008 1 次提交

CRED: Inaugurate COW credentials · d84f4f99

由 David Howells 提交于 11月 14, 2008

Inaugurate copy-on-write credentials management.  This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.

A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().

With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:

	struct cred *new = prepare_creds();
	int ret = blah(new);
	if (ret < 0) {
		abort_creds(new);
		return ret;
	}
	return commit_creds(new);

There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.

To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const.  The purpose of this is compile-time
discouragement of altering credentials through those pointers.  Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:

  (1) Its reference count may incremented and decremented.

  (2) The keyrings to which it points may be modified, but not replaced.

The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).

This patch and the preceding patches have been tested with the LTP SELinux
testsuite.

This patch makes several logical sets of alteration:

 (1) execve().

     This now prepares and commits credentials in various places in the
     security code rather than altering the current creds directly.

 (2) Temporary credential overrides.

     do_coredump() and sys_faccessat() now prepare their own credentials and
     temporarily override the ones currently on the acting thread, whilst
     preventing interference from other threads by holding cred_replace_mutex
     on the thread being dumped.

     This will be replaced in a future patch by something that hands down the
     credentials directly to the functions being called, rather than altering
     the task's objective credentials.

 (3) LSM interface.

     A number of functions have been changed, added or removed:

     (*) security_capset_check(), ->capset_check()
     (*) security_capset_set(), ->capset_set()

     	 Removed in favour of security_capset().

     (*) security_capset(), ->capset()

     	 New.  This is passed a pointer to the new creds, a pointer to the old
     	 creds and the proposed capability sets.  It should fill in the new
     	 creds or return an error.  All pointers, barring the pointer to the
     	 new creds, are now const.

     (*) security_bprm_apply_creds(), ->bprm_apply_creds()

     	 Changed; now returns a value, which will cause the process to be
     	 killed if it's an error.

     (*) security_task_alloc(), ->task_alloc_security()

     	 Removed in favour of security_prepare_creds().

     (*) security_cred_free(), ->cred_free()

     	 New.  Free security data attached to cred->security.

     (*) security_prepare_creds(), ->cred_prepare()

     	 New. Duplicate any security data attached to cred->security.

     (*) security_commit_creds(), ->cred_commit()

     	 New. Apply any security effects for the upcoming installation of new
     	 security by commit_creds().

     (*) security_task_post_setuid(), ->task_post_setuid()

     	 Removed in favour of security_task_fix_setuid().

     (*) security_task_fix_setuid(), ->task_fix_setuid()

     	 Fix up the proposed new credentials for setuid().  This is used by
     	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
     	 setuid() changes.  Changes are made to the new credentials, rather
     	 than the task itself as in security_task_post_setuid().

     (*) security_task_reparent_to_init(), ->task_reparent_to_init()

     	 Removed.  Instead the task being reparented to init is referred
     	 directly to init's credentials.

	 NOTE!  This results in the loss of some state: SELinux's osid no
	 longer records the sid of the thread that forked it.

     (*) security_key_alloc(), ->key_alloc()
     (*) security_key_permission(), ->key_permission()

     	 Changed.  These now take cred pointers rather than task pointers to
     	 refer to the security context.

 (4) sys_capset().

     This has been simplified and uses less locking.  The LSM functions it
     calls have been merged.

 (5) reparent_to_kthreadd().

     This gives the current thread the same credentials as init by simply using
     commit_thread() to point that way.

 (6) __sigqueue_alloc() and switch_uid()

     __sigqueue_alloc() can't stop the target task from changing its creds
     beneath it, so this function gets a reference to the currently applicable
     user_struct which it then passes into the sigqueue struct it returns if
     successful.

     switch_uid() is now called from commit_creds(), and possibly should be
     folded into that.  commit_creds() should take care of protecting
     __sigqueue_alloc().

 (7) [sg]et[ug]id() and co and [sg]et_current_groups.

     The set functions now all use prepare_creds(), commit_creds() and
     abort_creds() to build and check a new set of credentials before applying
     it.

     security_task_set[ug]id() is called inside the prepared section.  This
     guarantees that nothing else will affect the creds until we've finished.

     The calling of set_dumpable() has been moved into commit_creds().

     Much of the functionality of set_user() has been moved into
     commit_creds().

     The get functions all simply access the data directly.

 (8) security_task_prctl() and cap_task_prctl().

     security_task_prctl() has been modified to return -ENOSYS if it doesn't
     want to handle a function, or otherwise return the return value directly
     rather than through an argument.

     Additionally, cap_task_prctl() now prepares a new set of credentials, even
     if it doesn't end up using it.

 (9) Keyrings.

     A number of changes have been made to the keyrings code:

     (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
     	 all been dropped and built in to the credentials functions directly.
     	 They may want separating out again later.

     (b) key_alloc() and search_process_keyrings() now take a cred pointer
     	 rather than a task pointer to specify the security context.

     (c) copy_creds() gives a new thread within the same thread group a new
     	 thread keyring if its parent had one, otherwise it discards the thread
     	 keyring.

     (d) The authorisation key now points directly to the credentials to extend
     	 the search into rather pointing to the task that carries them.

     (e) Installing thread, process or session keyrings causes a new set of
     	 credentials to be created, even though it's not strictly necessary for
     	 process or session keyrings (they're shared).

(10) Usermode helper.

     The usermode helper code now carries a cred struct pointer in its
     subprocess_info struct instead of a new session keyring pointer.  This set
     of credentials is derived from init_cred and installed on the new process
     after it has been cloned.

     call_usermodehelper_setup() allocates the new credentials and
     call_usermodehelper_freeinfo() discards them if they haven't been used.  A
     special cred function (prepare_usermodeinfo_creds()) is provided
     specifically for call_usermodehelper_setup() to call.

     call_usermodehelper_setkeys() adjusts the credentials to sport the
     supplied keyring as the new session keyring.

(11) SELinux.

     SELinux has a number of changes, in addition to those to support the LSM
     interface changes mentioned above:

     (a) selinux_setprocattr() no longer does its check for whether the
     	 current ptracer can access processes with the new SID inside the lock
     	 that covers getting the ptracer's SID.  Whilst this lock ensures that
     	 the check is done with the ptracer pinned, the result is only valid
     	 until the lock is released, so there's no point doing it inside the
     	 lock.

(12) is_single_threaded().

     This function has been extracted from selinux_setprocattr() and put into
     a file of its own in the lib/ directory as join_session_keyring() now
     wants to use it too.

     The code in SELinux just checked to see whether a task shared mm_structs
     with other tasks (CLONE_VM), but that isn't good enough.  We really want
     to know if they're part of the same thread group (CLONE_THREAD).

(13) nfsd.

     The NFS server daemon now has to use the COW credentials to set the
     credentials it is going to use.  It really needs to pass the credentials
     down to the functions it calls, but it can't do that until other patches
     in this series have been applied.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NJames Morris <jmorris@namei.org>

d84f4f99

26 10月, 2008 1 次提交

Revert "Call init_workqueues before pre smp initcalls." · 4403b406

由 Linus Torvalds 提交于 10月 25, 2008

This reverts commit a802dd0e by moving
the call to init_workqueues() back where it belongs - after SMP has been
initialized.

It also moves stop_machine_init() - which needs workqueues - to a later
phase using a core_initcall() instead of early_initcall().  That should
satisfy all ordering requirements, and was apparently the reason why
init_workqueues() was moved to be too early.

Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4403b406

23 10月, 2008 2 次提交

memcg: fix page_cgroup allocation · 94b6da5a

由 KAMEZAWA Hiroyuki 提交于 10月 22, 2008

page_cgroup_init() is called from mem_cgroup_init(). But at this
point, we cannot call alloc_bootmem().
(and this caused panic at boot.)

This patch moves page_cgroup_init() to init/main.c.

Time table is following:
==
  parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
  ....
  cgroup_init_early()  # "early" init of cgroup.
  ....
  setup_arch()         # memmap is allocated.
  ...
  page_cgroup_init();
  mem_init();   # we cannot call alloc_bootmem after this.
  ....
  cgroup_init() # mem_cgroup is initialized.
==

Before page_cgroup_init(), mem_map must be initialized. So,
I added page_cgroup_init() to init/main.c directly.

(*) maybe this is not very clean but
    - cgroup_init_early() is too early
    - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
    use of vmalloc area in x86-32 is important and we should avoid very large
    vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
    directly to init/main.c

[akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
[akpm@linux-foundation.org: fix build]
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

94b6da5a

A
[PATCH 1/2] anondev: init IDR statically · 6de24f0e
由 Alexey Dobriyan 提交于 8月 28, 2008
```
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
```
6de24f0e

22 10月, 2008 2 次提交

Call init_workqueues before pre smp initcalls. · a802dd0e

由 Heiko Carstens 提交于 10月 13, 2008

This allows to create workqueues from within the context of
a pre smp initcall (aka early_initcall).
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

a802dd0e

Make initcall_debug a core_param · d0ea3d7d

由 Rusty Russell 提交于 10月 22, 2008

This is the one I really wanted: now it effects module loading, it
makes sense to be able to flip it after boot.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NArjan van de Ven <arjan@linux.intel.com>

d0ea3d7d

20 10月, 2008 1 次提交

mm: rewrite vmap layer · db64fe02

由 Nick Piggin 提交于 10月 18, 2008

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap.  Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache.  This is all done under a global lock.  As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
 This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock.  It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems.  The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated.  So the addresses aren't allocated again until
a subsequent TLB flush.  A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap.  Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages.  Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
in nanoseconds per map+touch+unmap.

threads           vanilla         vmap rewrite
1                 14700           2900
2                 33600           3000
4                 49500           2800
8                 70631           2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram...  along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system.  I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now.  vmap is pretty well blown off the
profiles.

Before:
1352059 total                                      0.1401
798784 _write_lock                              8320.6667 <- vmlist_lock
529313 default_idle                             1181.5022
 15242 smp_call_function                         15.8771  <- vmap tlb flushing
  2472 __get_vm_area_node                         1.9312  <- vmap
  1762 remove_vm_area                             4.5885  <- vunmap
   316 map_vm_area                                0.2297  <- vmap
   312 kfree                                      0.1950
   300 _spin_lock                                 3.1250
   252 sn_send_IPI_phys                           0.4375  <- tlb flushing
   238 vmap                                       0.8264  <- vmap
   216 find_lock_page                             0.5192
   196 find_next_bit                              0.3603
   136 sn2_send_IPI                               0.2024
   130 pio_phys_write_mmr                         2.0312
   118 unmap_kernel_range                         0.1229

After:
 78406 total                                      0.0081
 40053 default_idle                              89.4040
 33576 ia64_spinlock_contention                 349.7500
  1650 _spin_lock                                17.1875
   319 __reg_op                                   0.5538
   281 _atomic_dec_and_lock                       1.0977
   153 mutex_unlock                               1.5938
   123 iget_locked                                0.1671
   117 xfs_dir_lookup                             0.1662
   117 dput                                       0.1406
   114 xfs_iget_core                              0.0268
    92 xfs_da_hashname                            0.1917
    75 d_alloc                                    0.0670
    68 vmap_page_range                            0.0462 <- vmap
    58 kmem_cache_alloc                           0.0604
    57 memset                                     0.0540
    52 rb_next                                    0.1625
    50 __copy_user                                0.0208
    49 bitmap_find_free_region                    0.2188 <- vmap
    46 ia64_sn_udelay                             0.1106
    45 find_inode_fast                            0.1406
    42 memcmp                                     0.2188
    42 finish_task_switch                         0.1094
    42 __d_lookup                                 0.0410
    40 radix_tree_lookup_slot                     0.1250
    37 _spin_unlock_irqrestore                    0.3854
    36 xfs_bmapi                                  0.0050
    36 kmem_cache_free                            0.0256
    35 xfs_vn_getattr                             0.0322
    34 radix_tree_lookup                          0.1062
    33 __link_path_walk                           0.0035
    31 xfs_da_do_buf                              0.0091
    30 _xfs_buf_find                              0.0204
    28 find_get_page                              0.0875
    27 xfs_iread                                  0.0241
    27 __strncpy_from_user                        0.2812
    26 _xfs_buf_initialize                        0.0406
    24 _xfs_buf_lookup_pages                      0.0179
    24 vunmap_page_range                          0.0250 <- vunmap
    23 find_lock_page                             0.0799
    22 vm_map_ram                                 0.0087 <- vmap
    20 kfree                                      0.0125
    19 put_page                                   0.0330
    18 __kmalloc                                  0.0176
    17 xfs_da_node_lookup_int                     0.0086
    17 _read_lock                                 0.0885
    17 page_waitqueue                             0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: NNick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

db64fe02

16 10月, 2008 6 次提交

genirq: revert dynarray · d6c88a50

由 Thomas Gleixner 提交于 10月 15, 2008

Revert the dynarray changes. They need more thought and polishing.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

d6c88a50

Y
dyn_array: split dyn_array functions from init/main.c · e7f5ed8d
由 Yinghai Lu 提交于 8月 19, 2008
```
Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
e7f5ed8d

generic: add irq_desc in function in parameter · 46926b67

由 Yinghai Lu 提交于 8月 19, 2008

So we could remove some duplicated calling to irq_desc

v2: make sure irq_desc in  init/main.c is not used without generic_hardirqs
Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

46926b67

x86: alloc dyn_array all together · 1f8ff037

由 Yinghai Lu 提交于 8月 19, 2008

so could spare some memory with small alignment in bootmem

also tighten the alignment checking, and make print out less debug info.
Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1f8ff037

add per_cpu_dyn_array support · 1f3fcd4b

由 Yinghai Lu 提交于 8月 19, 2008

allow dyn-array in per_cpu area, allocated dynamically.

usage:

|  /* in .h */
| struct kernel_stat {
|        struct cpu_usage_stat   cpustat;
|        unsigned int *irqs;
| };
|
|  /* in .c */
| DEFINE_PER_CPU(struct kernel_stat, kstat);
|
| DEFINE_PER_CPU_DYN_ARRAY_ADDR(per_cpu__kstat_irqs, per_cpu__kstat.irqs, sizeof(unsigned int), nr_irqs, sizeof(unsigned long), NULL);

after setup_percpu()/per_cpu_alloc_dyn_array(), the dyn_array in
per_cpu area is ready to use.
Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1f3fcd4b

generic: add dyn_array support · 3ddfda11

由 Yinghai Lu 提交于 8月 19, 2008

Allow crazy big arrays via bootmem at init stage.
Architectures use CONFIG_HAVE_DYN_ARRAY to enable it.

usage:

| static struct irq_desc irq_desc_init __initdata = {
|        .status = IRQ_DISABLED,
|        .chip = &no_irq_chip,
|        .handle_irq = handle_bad_irq,
|        .depth = 1,
|        .lock = __SPIN_LOCK_UNLOCKED(irq_desc->lock),
| #ifdef CONFIG_SMP
|        .affinity = CPU_MASK_ALL
| #endif
| };
|
| static void __init init_work(void *data)
| {
|        struct dyn_array *da = data;
|        struct  irq_desc *desc;
|        int i;
|
|        desc = *da->name;
|
|        for (i = 0; i < *da->nr; i++)
|                memcpy(&desc[i], &irq_desc_init, sizeof(struct irq_desc));
| }
|
| struct irq_desc *irq_desc;
| DEFINE_DYN_ARRAY(irq_desc, sizeof(struct irq_desc), nr_irqs, PAGE_SIZE, init_work);

after pre_alloc_dyn_array() after setup_arch(), the array is ready to be
used.

Via this facility we can replace irq_desc[NR_IRQS] array with dyn_array
irq_desc[nr_irqs].

v2: remove _nopanic in pre_alloc_dyn_array()
Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3ddfda11

14 10月, 2008 7 次提交

tracing/fastboot: add better resolution to initcall debug/tracing · ca538f6b

由 Tim Bird 提交于 10月 09, 2008

Change the time resolution for initcall_debug to microseconds, from
milliseconds.  This is handy to determine which initcalls you want to work
on for faster booting.

One one of my test machines, over 90% of the initcalls are less than a
millisecond and (without this patch) these are all reported as 0 msecs.
Working on the 900 us ones is more important than the 4 us ones.

With 'quiet' on the kernel command line, this adds no significant overhead
to kernel boot time.
Signed-off-by: NTim Bird <tim.bird@am.sony.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ca538f6b

tracing/fastboot: only trace non-module initcalls · 097d036a

由 Frederic Weisbecker 提交于 10月 03, 2008

At this time, only built-in initcalls interest us.
We can't really produce a relevant graph if we include
the modules initcall too.

I had good results after this patch (see svg in attachment).
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

097d036a

tracing/fastboot: get the initcall name before it disappears · 5601020f

由 Frederic Weisbecker 提交于 10月 02, 2008

After some initcall traces, some initcall names may be inconsistent.
That's because these functions will disappear from the .init section
and also their name from the symbols table.

So we have to copy the name of the function in a buffer large enough
during the trace appending. It is not costly for the ring_buffer because
the number of initcall entries is commonly not really large.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

5601020f

tracing/fastboot: change the printing of boot tracer according to bootgraph.pl · cb5ab742

由 Frederic Weisbecker 提交于 10月 02, 2008

Change the boot tracer printing to make it parsable for
the scripts/bootgraph.pl script.

We have now to output two lines for each initcall, according to the
printk in do_one_initcall() in init/main.c
We need now the call's time and the return's time.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cb5ab742

tracing/ftrace: launch boot tracing after pre-smp initcalls · 3bf77af6

由 Frédéric Weisbecker 提交于 9月 23, 2008

Launch the boot tracing inside the initcall_debug area. Old printk
have not been removed to keep the old way of initcall tracing for
backward compatibility.

[ mingo@elte.hu: resolved conflicts ]
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3bf77af6

tracing/fastboot: add a script to visualize the kernel boot process / time · aa5d9151

由 Arjan van de Ven 提交于 9月 13, 2008

When optimizing the kernel boot time, it's very valuable to visualize
what is going on at which time. In addition, with the fastboot asynchronous
initcall level, it's very valuable to see which initcall gets run where
and when.

This patch adds a script to turn a dmesg into a SVG graph (that can be
shown with tools such as InkScape, Gimp or Firefox) and a small change
to the initcall code to print the PID of the thread calling the initcall
(so that the script can work out the parallelism).
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>

aa5d9151

ftrace: mcount call site on boot nops core · 68bf21aa

由 Steven Rostedt 提交于 8月 14, 2008

This is the infrastructure to the converting the mcount call sites
recorded by the __mcount_loc section into nops on boot. It also allows
for using these sites to enable tracing as normal. When the __mcount_loc
section is used, the "ftraced" kernel thread is disabled.

This uses the current infrastructure to record the mcount call sites
as well as convert them to nops. The mcount function is kept as a stub
on boot up and not converted to the ftrace_record_ip function. We use the
ftrace_record_ip to only record from the table.

This patch does not handle modules. That comes with a later patch.
Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

68bf21aa

12 10月, 2008 1 次提交

Add a script to visualize the kernel boot process / time · f9b9796a

由 Arjan van de Ven 提交于 9月 13, 2008

When optimizing the kernel boot time, it's very valuable to visualize
what is going on at which time. In addition, with some of the initializing
going asynchronous soon, it's valuable to track/print which worker thread
is executing the initialization.

This patch adds a script to turn a dmesg into a SVG graph (that can be
shown with tools such as InkScape, Gimp or Firefox) and a small change
to the initcall code to print the PID of the thread calling the initcall
(so that the script can work out the parallelism).
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>

f9b9796a

04 10月, 2008 1 次提交

Fix init/main.c to use regular printk with '%pF' for initcall fn · 96d746c6

由 Linus Torvalds 提交于 10月 03, 2008

.. small detail, but the silly e1000e initcall warning debugging caused
me to look at this code.  Rather than gouge my eyes out with a spoon, I
just fixed it.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

96d746c6

12 8月, 2008 1 次提交

modules: extend initcall_debug functionality to the module loader · 59f9415f

由 Arjan van de Ven 提交于 7月 30, 2008

The kernel has this really nice facility where if you put "initcall_debug"
on the kernel commandline, it'll print which function it's going to
execute just before calling an initcall, and then after the call completes
it will

1) print if it had an error code

2) checks for a few simple bugs (like leaving irqs off)
and

3) print how long the init call took in milliseconds.

While trying to optimize the boot speed of my laptop, I have been loving
number 3 to figure out what to optimize...  ...  and then I wished that
the same thing was done for module loading.

This patch makes the module loader use this exact same functionality; it's
a logical extension in my view (since modules are just sort of late
binding initcalls anyway) and so far I've found it quite useful in finding
where things are too slow in my boot.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

59f9415f

06 8月, 2008 1 次提交

remove unnecessary <linux/hdreg.h> includes · b5b9309d

由 Bartlomiej Zolnierkiewicz 提交于 8月 05, 2008

Following files don't need <linux/hdreg.h> at all:

- arch/mips/jazz/setup.c
- arch/sh/boards/mach-systemh/irq.c
- drivers/macintosh/mediabay.c
- drivers/scsi/hptiop.c
- drivers/usb/storage/freecom.c
- arch/powerpc/include/asm/ide.h
- init/main.c

Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>

b5b9309d

31 7月, 2008 1 次提交

initrd: cast `initrd_start' to `void *' · bd673c7c

由 Geert Uytterhoeven 提交于 7月 29, 2008

commit fb6624eb (initrd: Fix virtual/physical
mix-up in overwrite test) introduced the compiler warning below on mips,
as its virt_to_page() doesn't cast the passed address to unsigned long
internally, unlike on most other architectures:

init/main.c: In function `start_kernel':
init/main.c:633: warning: passing argument 1 of `virt_to_phys' makes pointer from integer without a cast
init/main.c:636: warning: passing argument 1 of `virt_to_phys' makes pointer from integer without a cast

For now, kill the warning by explicitly casting initrd_start to `void *', as
that's the type it should really be.
Reported-by: NAtsushi Nemoto <anemo@mba.ocn.ne.jp>
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bd673c7c

27 7月, 2008 2 次提交

Full conversion to early_initcall() interface, remove old interface · 7babe8db

由 Eduard - Gabriel Munteanu 提交于 7月 25, 2008

A previous patch added the early_initcall(), to allow a cleaner hooking of
pre-SMP initcalls.  Now we remove the older interface, converting all
existing users to the new one.

[akpm@linux-foundation.org: cleanups]
[akpm@linux-foundation.org: build fix]
[kosaki.motohiro@jp.fujitsu.com: warning fix]
[kosaki.motohiro@jp.fujitsu.com: warning fix]
Signed-off-by: NEduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7babe8db

Better interface for hooking early initcalls · c2147a50

由 Eduard - Gabriel Munteanu 提交于 7月 25, 2008

Added early initcall (pre-SMP) support, using an identical interface to
that of regular initcalls. Functions called from do_pre_smp_initcalls()
could be converted to use this cleaner interface.

This is required by CPU hotplug, because early users have to register
notifiers before going SMP. One such CPU hotplug user is the relay
interface with buffer-only channels, which needs to register such a
notifier, to be usable in early code. This in turn is used by kmemtrace.
Signed-off-by: NEduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Tom Zanussi <tzanussi@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c2147a50

26 7月, 2008 1 次提交

proper pid{hash,map}_init() prototypes · 3ae4eed3

由 Adrian Bunk 提交于 7月 25, 2008

This patch adds proper prototypes for pid{hash,map}_init() in
include/linux/pid_namespace.h
Signed-off-by: NAdrian Bunk <bunk@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ae4eed3

21 7月, 2008 1 次提交

initrd: Fix virtual/physical mix-up in overwrite test · fb6624eb

由 Geert Uytterhoeven 提交于 7月 17, 2008

On recent kernels, I get the following error when using an initrd:

| initrd overwritten (0x00b78000 < 0x07668000) - disabling it.

My Amiga 4000 has 12 MiB of RAM at physical address 0x07400000 (virtual
0x00000000).
The initrd is located at the end of RAM: 0x00b78000 - 0x00c00000 (virtual).
The overwrite test compares the (virtual) initrd location to the (physical)
first available memory location, which fails.

This patch converts initrd_start to a page frame number, so it can safely be
compared with min_low_pfn.

Before the introduction of discontiguous memory support on m68k
(12d810c1), min_low_pfn was just left
untouched by the m68k-specific code (zero, I guess), and everything worked
fine.
Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fb6624eb

18 7月, 2008 1 次提交

cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2) · e761b772

由 Max Krasnyansky 提交于 7月 15, 2008

This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.

It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.

Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.
Cpuset changes to make rebuild_sched_domains() callable from various
contexts are in the separate patch (right next after this one).

This not only boots but also easily handles
	while true; do make clean; make -j 8; done
and
	while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.

Also this is running with most of the debug features enabled (lockdep,
mutex, etc) no BUG_ONs or lockdep complaints so far.

I believe I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).
Signed-off-by: NMax Krasnyanskiy <maxk@qualcomm.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NGregory Haskins <ghaskins@novell.com>
Cc: dmitry.adamushko@gmail.com
Cc: pj@sgi.com
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e761b772

26 6月, 2008 1 次提交

Add generic helpers for arch IPI function calls · 3d442233

由 Jens Axboe 提交于 6月 26, 2008

This adds kernel/smp.c which contains helpers for IPI function calls. In
addition to supporting the existing smp_call_function() in a more efficient
manner, it also adds a more scalable variant called smp_call_function_single()
for calling a given function on a single CPU only.

The core of this is based on the x86-64 patch from Nick Piggin, lots of
changes since then. "Alan D. Brunelle" <Alan.Brunelle@hp.com> has
contributed lots of fixes and suggestions as well. Also thanks to
Paul E. McKenney <paulmck@linux.vnet.ibm.com> for reviewing RCU usage
and getting rid of the data allocation fallback deadlock.
Acked-by: NIngo Molnar <mingo@elte.hu>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

3d442233

19 5月, 2008 1 次提交

rcu: add call_rcu_sched() · 4446a36f

由 Paul E. McKenney 提交于 5月 12, 2008

Fourth cut of patch to provide the call_rcu_sched().  This is again to
synchronize_sched() as call_rcu() is to synchronize_rcu().

Should be fine for experimental and -rt use, but not ready for inclusion.
With some luck, I will be able to tell Andrew to come out of hiding on
the next round.

Passes multi-day rcutorture sessions with concurrent CPU hotplugging.

Fixes since the first version include a bug that could result in
indefinite blocking (spotted by Gautham Shenoy), better resiliency
against CPU-hotplug operations, and other minor fixes.

Fixes since the second version include reworking grace-period detection
to avoid deadlocks that could happen when running concurrently with
CPU hotplug, adding Mathieu's fix to avoid the softlockup messages,
as well as Mathieu's fix to allow use earlier in boot.

Fixes since the third version include a wrong-CPU bug spotted by
Andrew, getting rid of the obsolete synchronize_kernel API that somehow
snuck back in, merging spin_unlock() and local_irq_restore() in a
few places, commenting the code that checks for quiescent states based
on interrupting from user-mode execution or the idle loop, removing
some inline attributes, and some code-style changes.

Known/suspected shortcomings:

o	I still do not entirely trust the sleep/wakeup logic.  Next step
	will be to use a private snapshot of the CPU online mask in
	rcu_sched_grace_period() -- if the CPU wasn't there at the start
	of the grace period, we don't need to hear from it.  And the
	bit about accounting for changes in online CPUs inside of
	rcu_sched_grace_period() is ugly anyway.

o	It might be good for rcu_sched_grace_period() to invoke
	resched_cpu() when a given CPU wasn't responding quickly,
	but resched_cpu() is declared static...

This patch also fixes a long-standing bug in the earlier preemptable-RCU
implementation of synchronize_rcu() that could result in loss of
concurrent external changes to a task's CPU affinity mask.  I still cannot
remember who reported this...
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

4446a36f

16 5月, 2008 3 次提交

initcalls: Fix m68k build and possible buffer overflow · a76bfd0d

由 Cyrill Gorcunov 提交于 5月 15, 2008

This patch fixes a build bug on m68k - gcc decides to emit a call to the
strlen library function, which we don't implement.

More importantly - my previous patch "init: don't lose initcall return
values" (commit e662e1cf) had introduced
potential buffer overflow by wrong calculation of string accumulator
size.

Use strlcat() instead, fixing both bugs.

Many thanks Andreas Schwab and Geert Uytterhoeven for helping
to catch and fix the bug.
Signed-off-by: NCyrill Gorcunov <gorcunov@gmail.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a76bfd0d

Split up 'do_initcalls()' into two simpler functions · e0df154f

由 Linus Torvalds 提交于 5月 15, 2008

One function to just loop over the entries, one function to actually do
the call and the associated debugging code.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e0df154f

Clean up 'print_fn_descriptor_symbol()' types · a442ac51

由 Linus Torvalds 提交于 5月 15, 2008

Everybody wants to pass it a function pointer, and in fact, that is what
you _must_ pass it for it to make sense (since it knows that ia64 and
ppc64 use descriptors for function pointers and fetches the actual
address from there).

So don't make the argument be a 'unsigned long' and force everybody to
add a cast.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a442ac51

13 5月, 2008 1 次提交

init: don't lose initcall return values · e662e1cf

由 Cyrill Gorcunov 提交于 5月 12, 2008

There is an ability to lose an initcall return value if it happened with irq
disabled or imbalanced preemption (and if we debug initcall).
Signed-off-by: NCyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e662e1cf

06 5月, 2008 1 次提交

sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK · 3e51f33f

由 Peter Zijlstra 提交于 5月 03, 2008

this replaces the rq->clock stuff (and possibly cpu_clock()).

 - architectures that have an 'imperfect' hardware clock can set
   CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

 - the 'jiffie' window might be superfulous when we update tick_gtod
   before the __update_sched_clock() call in sched_clock_tick()

 - cpu_clock() might be implemented as:

     sched_clock_cpu(smp_processor_id())

   if the accuracy proves good enough - how far can TSC drift in a
   single jiffie when considering the filtering and idle hooks?

[ mingo@elte.hu: various fixes and cleanups ]
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3e51f33f

30 4月, 2008 3 次提交

infrastructure to debug (dynamic) objects · 3ac7fe5a

由 Thomas Gleixner 提交于 4月 30, 2008

We can see an ever repeating problem pattern with objects of any kind in the
kernel:

1) freeing of active objects
2) reinitialization of active objects

Both problems can be hard to debug because the crash happens at a point where
we have no chance to decode the root cause anymore.  One problem spot are
kernel timers, where the detection of the problem often happens in interrupt
context and usually causes the machine to panic.

While working on a timer related bug report I had to hack specialized code
into the timer subsystem to get a reasonable hint for the root cause.  This
debug hack was fine for temporary use, but far from a mergeable solution due
to the intrusiveness into the timer code.

The code further lacked the ability to detect and report the root cause
instantly and keep the system operational.

Keeping the system operational is important to get hold of the debug
information without special debugging aids like serial consoles and special
knowledge of the bug reporter.

The problems described above are not restricted to timers, but timers tend to
expose it usually in a full system crash.  Other objects are less explosive,
but the symptoms caused by such mistakes can be even harder to debug.

Instead of creating specialized debugging code for the timer subsystem a
generic infrastructure is created which allows developers to verify their code
and provides an easy to enable debug facility for users in case of trouble.

The debugobjects core code keeps track of operations on static and dynamic
objects by inserting them into a hashed list and sanity checking them on
object operations and provides additional checks whenever kernel memory is
freed.

The tracked object operations are:
- initializing an object
- adding an object to a subsystem list
- deleting an object from a subsystem list

Each operation is sanity checked before the operation is executed and the
subsystem specific code can provide a fixup function which allows to prevent
the damage of the operation.  When the sanity check triggers a warning message
and a stack trace is printed.

The list of operations can be extended if the need arises.  For now it's
limited to the requirements of the first user (timers).

The core code enqueues the objects into hash buckets.  The hash index is
generated from the address of the object to simplify the lookup for the check
on kfree/vfree.  Each bucket has it's own spinlock to avoid contention on a
global lock.

The debug code can be compiled in without being active.  The runtime overhead
is minimal and could be optimized by asm alternatives.  A kernel command line
option enables the debugging code.

Thanks to Ingo Molnar for review, suggestions and cleanup patches.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ac7fe5a

Deprecate find_task_by_pid() · 5cd20455

由 Pavel Emelyanov 提交于 4月 30, 2008

There are some places that are known to operate on tasks'
global pids only:

* the rest_init() call (called on boot)
* the kgdb's getthread
* the create_kthread() (since the kthread is run in init ns)

So use the find_task_by_pid_ns(..., &init_pid_ns) there
and schedule the find_task_by_pid for removal.

[sukadev@us.ibm.com: Fix warning in kernel/pid.c]
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5cd20455

signals: fix /sbin/init protection from unwanted signals · fae5fa44

由 Oleg Nesterov 提交于 4月 30, 2008

The global init has a lot of long standing problems with the unhandled fatal
signals.

	- The "is_global_init(current)" check in get_signal_to_deliver()
	  protects only the main thread. Sub-thread can dequee the fatal
	  signal and shutdown the whole thread group except the main thread.
	  If it dequeues SIGSTOP /sbin/init will be stopped, this is not
	  right too. Note that we can't use is_global_init(->group_leader),
	  this breaks exec and this can't solve other problems we have.

	- Even if afterwards ignored, the fatal signals sets SIGNAL_GROUP_EXIT
	  on delivery. This breaks exec, has other bad implications, and this
	  is just wrong.

Introduce the new SIGNAL_UNKILLABLE flag to fix these problems.  It also helps
to solve some other problems addressed by the subsequent patches.

Currently we use this flag for the global init only, but it could also be used
by kthreads and (perhaps) by the sub-namespace inits.
Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fae5fa44