提交 · 7db905e636f08ea5bc9825c1f73d77802e8ccad5 · openeuler / Kernel

04 9月, 2009 1 次提交

rcu: Move end of special early-boot RCU operation earlier · 7db905e6

由 Paul E. McKenney 提交于 9月 02, 2009

Ingo was getting warnings from rcu_scheduler_starting()
indicating that context switches had occurred before RCU ended
its special early-boot handling of grace periods.

This is a dangerous condition, as it indicates that RCU might
have prematurely ended grace periods.  This exploratory fix
moves rcu_scheduler_starting() earlier in boot.
Reported-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7db905e6

21 8月, 2009 1 次提交

tracing: Fix too large stack usage in do_one_initcall() · 4a683bf9

由 Ingo Molnar 提交于 8月 21, 2009

One of my testboxes triggered this nasty stack overflow crash
during SCSI probing:

[    5.874004] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    5.875004] device: 'sda': device_add
[    5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
[    5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
[    5.878004] *pde = 00000000
[    5.878004] Thread overran stack, or stack corrupted
[    5.878004] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[    5.878004] last sysfs file:
[    5.878004]
[    5.878004] Pid: 1, comm: swapper Not tainted (2.6.31-rc6-tip-01272-g9919e28-dirty #5685)
[    5.878004] EIP: 0060:[<b1008321>] EFLAGS: 00010083 CPU: 0
[    5.878004] EIP is at print_context_stack+0x81/0x110
[    5.878004] EAX: cf8a3000 EBX: cf8a3fe4 ECX: 00000049 EDX: 00000000
[    5.878004] ESI: b1cfce84 EDI: 00000000 EBP: cf8a3018 ESP: cf8a2ff4
[    5.878004]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
[    5.878004] Process swapper (pid: 1, ti=cf8a2000 task=cf8a8000 task.ti=cf8a3000)
[    5.878004] Stack:
[    5.878004]  b1004867 fffff000 cf8a3ffc
[    5.878004] Call Trace:
[    5.878004]  [<b1004867>] ? kernel_thread_helper+0x7/0x10
[    5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
[    5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
[    5.878004] *pde = 00000000
[    5.878004] Thread overran stack, or stack corrupted
[    5.878004] Oops: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC

The oops did not reveal any more details about the real stack
that we have and the system got into an infinite loop of
recursive pagefaults.

So i booted with CONFIG_STACK_TRACER=y and the 'stacktrace' boot
parameter. The box did not crash (timings/conditions probably
changed a tiny bit to trigger the catastrophic crash), but the
/debug/tracing/stack_trace file was rather revealing:

        Depth    Size   Location    (72 entries)
        -----    ----   --------
  0)     3704      52   __change_page_attr+0xb8/0x290
  1)     3652      24   __change_page_attr_set_clr+0x43/0x90
  2)     3628      60   kernel_map_pages+0x108/0x120
  3)     3568      40   prep_new_page+0x7d/0x130
  4)     3528      84   get_page_from_freelist+0x106/0x420
  5)     3444     116   __alloc_pages_nodemask+0xd7/0x550
  6)     3328      36   allocate_slab+0xb1/0x100
  7)     3292      36   new_slab+0x1c/0x160
  8)     3256      36   __slab_alloc+0x133/0x2b0
  9)     3220       4   kmem_cache_alloc+0x1bb/0x1d0
 10)     3216     108   create_object+0x28/0x250
 11)     3108      40   kmemleak_alloc+0x81/0xc0
 12)     3068      24   kmem_cache_alloc+0x162/0x1d0
 13)     3044      52   scsi_pool_alloc_command+0x29/0x70
 14)     2992      20   scsi_host_alloc_command+0x22/0x70
 15)     2972      24   __scsi_get_command+0x1b/0x90
 16)     2948      28   scsi_get_command+0x35/0x90
 17)     2920      24   scsi_setup_blk_pc_cmnd+0xd4/0x100
 18)     2896     128   sd_prep_fn+0x332/0xa70
 19)     2768      36   blk_peek_request+0xe7/0x1d0
 20)     2732      56   scsi_request_fn+0x54/0x520
 21)     2676      12   __generic_unplug_device+0x2b/0x40
 22)     2664      24   blk_execute_rq_nowait+0x59/0x80
 23)     2640     172   blk_execute_rq+0x6b/0xb0
 24)     2468      32   scsi_execute+0xe0/0x140
 25)     2436      64   scsi_execute_req+0x152/0x160
 26)     2372      60   scsi_vpd_inquiry+0x6c/0x90
 27)     2312      44   scsi_get_vpd_page+0x112/0x160
 28)     2268      52   sd_revalidate_disk+0x1df/0x320
 29)     2216      92   rescan_partitions+0x98/0x330
 30)     2124      52   __blkdev_get+0x309/0x350
 31)     2072       8   blkdev_get+0xf/0x20
 32)     2064      44   register_disk+0xff/0x120
 33)     2020      36   add_disk+0x6e/0xb0
 34)     1984      44   sd_probe_async+0xfb/0x1d0
 35)     1940      44   __async_schedule+0xf4/0x1b0
 36)     1896       8   async_schedule+0x12/0x20
 37)     1888      60   sd_probe+0x305/0x360
 38)     1828      44   really_probe+0x63/0x170
 39)     1784      36   driver_probe_device+0x5d/0x60
 40)     1748      16   __device_attach+0x49/0x50
 41)     1732      32   bus_for_each_drv+0x5b/0x80
 42)     1700      24   device_attach+0x6b/0x70
 43)     1676      16   bus_attach_device+0x47/0x60
 44)     1660      76   device_add+0x33d/0x400
 45)     1584      52   scsi_sysfs_add_sdev+0x6a/0x2c0
 46)     1532     108   scsi_add_lun+0x44b/0x460
 47)     1424     116   scsi_probe_and_add_lun+0x182/0x4e0
 48)     1308      36   __scsi_add_device+0xd9/0xe0
 49)     1272      44   ata_scsi_scan_host+0x10b/0x190
 50)     1228      24   async_port_probe+0x96/0xd0
 51)     1204      44   __async_schedule+0xf4/0x1b0
 52)     1160       8   async_schedule+0x12/0x20
 53)     1152      48   ata_host_register+0x171/0x1d0
 54)     1104      60   ata_pci_sff_activate_host+0xf3/0x230
 55)     1044      44   ata_pci_sff_init_one+0xea/0x100
 56)     1000      48   amd_init_one+0xb2/0x190
 57)      952       8   local_pci_probe+0x13/0x20
 58)      944      32   pci_device_probe+0x68/0x90
 59)      912      44   really_probe+0x63/0x170
 60)      868      36   driver_probe_device+0x5d/0x60
 61)      832      20   __driver_attach+0x89/0xa0
 62)      812      32   bus_for_each_dev+0x5b/0x80
 63)      780      12   driver_attach+0x1e/0x20
 64)      768      72   bus_add_driver+0x14b/0x2d0
 65)      696      36   driver_register+0x6e/0x150
 66)      660      20   __pci_register_driver+0x53/0xc0
 67)      640       8   amd_init+0x14/0x16
 68)      632     572   do_one_initcall+0x2b/0x1d0
 69)       60      12   do_basic_setup+0x56/0x6a
 70)       48      20   kernel_init+0x84/0xce
 71)       28      28   kernel_thread_helper+0x7/0x10

There's a lot of fat functions on that stack trace, but
the largest of all is do_one_initcall(). This is due to
the boot trace entry variables being on the stack.

Fixing this is relatively easy, initcalls are fundamentally
serialized, so we can move the local variables to file scope.

Note that this large stack footprint was present for a
couple of months already - what pushed my system over
the edge was the addition of kmemleak to the call-chain:

  6)     3328      36   allocate_slab+0xb1/0x100
  7)     3292      36   new_slab+0x1c/0x160
  8)     3256      36   __slab_alloc+0x133/0x2b0
  9)     3220       4   kmem_cache_alloc+0x1bb/0x1d0
 10)     3216     108   create_object+0x28/0x250
 11)     3108      40   kmemleak_alloc+0x81/0xc0
 12)     3068      24   kmem_cache_alloc+0x162/0x1d0
 13)     3044      52   scsi_pool_alloc_command+0x29/0x70

This pushes the total to ~3800 bytes, only a tiny bit
more was needed to corrupt the on-kernel-stack thread_info.

The fix reduces the stack footprint from 572 bytes
to 28 bytes.

Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <srostedt@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4a683bf9

14 8月, 2009 1 次提交

init: set nr_cpu_ids before setup_per_cpu_areas() · d6647bdf

由 Tejun Heo 提交于 7月 21, 2009

nr_cpu_ids is dependent only on cpu_possible_map and
setup_per_cpu_areas() already depends on cpu_possible_map and will use
nr_cpu_ids.  Initialize nr_cpu_ids before setting up percpu areas.
Signed-off-by: NTejun Heo <tj@kernel.org>

d6647bdf

23 6月, 2009 1 次提交

mm/init: cpu_hotplug_init() must be initialized before SLAB · 31950eb6

由 Linus Torvalds 提交于 6月 22, 2009

SLAB uses get/put_online_cpus() which use a mutex which is itself only
initialized when cpu_hotplug_init() is called.  Currently we hang suring
boot in SLAB due to doing that too late.

Reported by James Bottomley and Sachin Sant (and possibly others).
Debugged by Benjamin Herrenschmidt.

This just removes the dynamic initialization of the data structures, and
replaces it with a static one, avoiding this dependency entirely, and
removing one unnecessary special initcall.
Tested-by: NSachin Sant <sachinp@in.ibm.com>
Tested-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
Tested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

31950eb6

19 6月, 2009 2 次提交

mm: Extend gfp masking to the page allocator · dcce284a

由 Benjamin Herrenschmidt 提交于 6月 18, 2009

The page allocator also needs the masking of gfp flags during boot,
so this moves it out of slab/slub and uses it with the page allocator
as well.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dcce284a

kernel: constructor support · b99b87f7

由 Peter Oberparleiter 提交于 6月 17, 2009

Call constructors (gcc-generated initcall-like functions) during kernel
start and module load.  Constructors are e.g.  used for gcov data
initialization.

Disable constructor support for usermode Linux to prevent conflicts with
host glibc.
Signed-off-by: NPeter Oberparleiter <oberpar@linux.vnet.ibm.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NWANG Cong <xiyou.wangcong@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Li Wei <W.Li@Sun.COM>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Heiko Carstens <heicars2@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <mschwid2@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b99b87f7

17 6月, 2009 2 次提交

mm: Move pgtable_cache_init() earlier · c868d550

由 Benjamin Herrenschmidt 提交于 6月 17, 2009

Some architectures need to initialize SLAB caches to be able
to allocate page tables. They do that from pgtable_cache_init()
so the later should be called earlier now, best is before
vmalloc_init().
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c868d550

cpuset,mm: update tasks' mems_allowed in time · 58568d2a

由 Miao Xie 提交于 6月 16, 2009

Fix allocating page cache/slab object on the unallowed node when memory
spread is set by updating tasks' mems_allowed after its cpuset's mems is
changed.

In order to update tasks' mems_allowed in time, we must modify the code of
memory policy.  Because the memory policy is applied in the process's
context originally.  After applying this patch, one task directly
manipulates anothers mems_allowed, and we use alloc_lock in the
task_struct to protect mems_allowed and memory policy of the task.

But in the fast path, we didn't use lock to protect them, because adding a
lock may lead to performance regression.  But if we don't add a lock,the
task might see no nodes when changing cpuset's mems_allowed to some
non-overlapping set.  In order to avoid it, we set all new allowed nodes,
then clear newly disallowed ones.

[lee.schermerhorn@hp.com:
  The rework of mpol_new() to extract the adjusting of the node mask to
  apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
  with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
  allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
  node mask to mpol_new_mpolicy().

  Remove the now unneeded 'nodes = NULL' from mpol_new().

  Note that mpol_new_mempolicy() is always called with a non-NULL
  'nodes' parameter now that it has been removed from mpol_new().
  Therefore, we don't need to test nodes for NULL before testing it for
  'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
  verify this assumption.]
[lee.schermerhorn@hp.com:

  I don't think the function name 'mpol_new_mempolicy' is descriptive
  enough to differentiate it from mpol_new().

  This function applies cpuset set context, usually constraining nodes
  to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
  is set, it also translates the nodes.  So I settled on
  'mpol_set_nodemask()', because the comment block for mpol_new() mentions
  that we need to call this function to "set nodes".

  Some additional minor line length, whitespace and typo cleanup.]
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Paul Menage <menage@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

58568d2a

13 6月, 2009 2 次提交

kmemcheck: add the kmemcheck core · dfec072e

由 Vegard Nossum 提交于 4月 04, 2008

General description: kmemcheck is a patch to the linux kernel that
detects use of uninitialized memory. It does this by trapping every
read and write to memory that was allocated dynamically (e.g. using
kmalloc()). If a memory address is read that has not previously been
written to, a message is printed to the kernel log.

Thanks to Andi Kleen for the set_memory_4k() solution.

Andrew Morton suggested documenting the shadow member of struct page.
Signed-off-by: NVegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

[export kmemcheck_mark_initialized]
[build fix for setup_max_cpus]
Signed-off-by: NIngo Molnar <mingo@elte.hu>

[rebased for mainline inclusion]
Signed-off-by: NVegard Nossum <vegardno@ifi.uio.no>

dfec072e

L
ACPI: move declaration acpi_early_init() to acpi.h · 4a7a16dc
由 Len Brown 提交于 6月 12, 2009
```
Signed-off-by: NLen Brown <len.brown@intel.com>
```
4a7a16dc

12 6月, 2009 6 次提交

slab,slub: don't enable interrupts during early boot · 7e85ee0c

由 Pekka Enberg 提交于 6月 12, 2009

As explained by Benjamin Herrenschmidt:

  Oh and btw, your patch alone doesn't fix powerpc, because it's missing
  a whole bunch of GFP_KERNEL's in the arch code... You would have to
  grep the entire kernel for things that check slab_is_available() and
  even then you'll be missing some.

  For example, slab_is_available() didn't always exist, and so in the
  early days on powerpc, we used a mem_init_done global that is set form
  mem_init() (not perfect but works in practice). And we still have code
  using that to do the test.

Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
in early boot code to avoid enabling interrupts.
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

7e85ee0c

memcg: fix page_cgroup fatal error in FLATMEM · ca371c0d

由 KAMEZAWA Hiroyuki 提交于 6月 12, 2009

Now, SLAB is configured in very early stage and it can be used in
init routine now.

But replacing alloc_bootmem() in FLAT/DISCONTIGMEM's page_cgroup()
initialization breaks the allocation, now.
(Works well in SPARSEMEM case...it supports MEMORY_HOTPLUG and
 size of page_cgroup is in reasonable size (< 1 << MAX_ORDER.)

This patch revive FLATMEM+memory cgroup by using alloc_bootmem.

In future,
We stop to support FLATMEM (if no users) or rewrite codes for flatmem
completely.But this will adds more messy codes and overheads.
Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
Tested-by: NLi Zefan <lizf@cn.fujitsu.com>
Tested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

ca371c0d

init: introduce mm_init() · 444f478f

由 Pekka Enberg 提交于 6月 11, 2009

As suggested by Christoph Lameter, introduce mm_init() now that we initialize
all the kernel memory allocations together.

Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

444f478f

vmalloc: use kzalloc() instead of alloc_bootmem() · 43ebdac4

由 Pekka Enberg 提交于 5月 25, 2009

We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
the bootmem allocator when initializing vmalloc data structures.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NNick Piggin <npiggin@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

43ebdac4

slab: setup allocators earlier in the boot sequence · 83b519e8

由 Pekka Enberg 提交于 6月 10, 2009

This patch makes kmalloc() available earlier in the boot sequence so we can get
rid of some bootmem allocations. The bulk of the changes are due to
kmem_cache_init() being called with interrupts disabled which requires some
changes to allocator boostrap code.

Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
before we call mem_init() during boot as reported by Ingo Molnar:

  We have a hard crash in the WP-protect code:

  [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000
  [    0.000000]      EDI 00000188  ESI 00000ac7  EBP c17eaf9c  ESP c17eaf8c
  [    0.000000]      EBX 000014e0  EDX 0000000e  ECX 01856067  EAX 00000001
  [    0.000000]      err 00000003  EIP c10135b1   CS 00000060  flg 00010002
  [    0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc
  [    0.000000]        00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0
  [    0.000000]        c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003
  [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203
  [    0.000000] Call Trace:
  [    0.000000]  [<c15357c2>] ? printk+0x14/0x16
  [    0.000000]  [<c10135b1>] ? do_test_wp_bit+0x19/0x23
  [    0.000000]  [<c17fd410>] ? test_wp_bit+0x26/0x64
  [    0.000000]  [<c17fd7e5>] ? mem_init+0x1ba/0x1d8
  [    0.000000]  [<c17f1668>] ? start_kernel+0x164/0x2f7
  [    0.000000]  [<c17f1322>] ? unknown_bootoption+0x0/0x19c
  [    0.000000]  [<c17f106a>] ? __init_begin+0x6a/0x6f
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by Linus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

83b519e8

kmemleak: Add the base support · 3c7b4e6b

由 Catalin Marinas 提交于 6月 11, 2009

This patch adds the base support for the kernel memory leak
detector. It traces the memory allocation/freeing in a way similar to
the Boehm's conservative garbage collector, the difference being that
the unreferenced objects are not freed but only shown in
/sys/kernel/debug/kmemleak. Enabling this feature introduces an
overhead to memory allocations.
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

3c7b4e6b

25 5月, 2009 1 次提交

Use a format for linux_banner · 657cafa6

由 Alex Riesen 提交于 5月 24, 2009

There is no format specifiers left in the linux_banner, and gcc-4.3
complains seeing the printk.
Signed-off-by: NAlex Riesen <raa.lkml@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

657cafa6

17 4月, 2009 1 次提交

Driver Core: early platform driver · 13977091

由 Magnus Damm 提交于 3月 30, 2009

V3 of the early platform driver implementation.

Platform drivers are great for embedded platforms because we can separate
driver configuration from the actual driver.  So base addresses,
interrupts and other configuration can be kept with the processor or board
code, and the platform driver can be reused by many different platforms.

For early devices we have nothing today.  For instance, to configure early
timers and early serial ports we cannot use platform devices.  This
because the setup order during boot.  Timers are needed before the
platform driver core code is available.  The same goes for early printk
support.  Early in this case means before initcalls.

These early drivers today have their configuration either hard coded or
they receive it using some special configuration method.  This is working
quite well, but if we want to support both regular kernel modules and
early devices then we need to have two ways of configuring the same
driver.  A single way would be better.

The early platform driver patch is basically a set of functions that allow
drivers to register themselves and architecture code to locate them and
probe.  Registration happens through early_param().  The time for the
probe is decided by the architecture code.

See Documentation/driver-model/platform.txt for more details.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: NMagnus Damm <damm@igel.co.jp>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: David Brownell <david-b@pacbell.net>
Cc: Tejun Heo <htejun@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>

13977091

12 4月, 2009 1 次提交

tracing, kmemtrace: Separate include/trace/kmemtrace.h to kmemtrace part and tracepoint part · 02af61bb

由 Zhaolei 提交于 4月 10, 2009

Impact: refactor code for future changes

Current kmemtrace.h is used both as header file of kmemtrace and kmem's
tracepoints definition.

Tracepoints' definition file may be used by other code, and should only have
definition of tracepoint.

We can separate include/trace/kmemtrace.h into 2 files:

  include/linux/kmemtrace.h: header file for kmemtrace
  include/trace/kmem.h:      definition of kmem tracepoints
Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
Acked-by: NEduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <49DEE68A.5040902@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

02af61bb

01 4月, 2009 1 次提交

init/main.c: fix sparse warnings: context imbalance · acdd052a

由 Hannes Eder 提交于 3月 31, 2009

Impact: Attribute function 'init_post' with __releases(...).

Fix these sparse warnings:
  init/main.c:805:21: warning: context imbalance in 'init_post' - unexpected unlock
  init/main.c:899:9: warning: context imbalance in 'kernel_init' - wrong count at exit
Signed-off-by: NHannes Eder <hannes@hanneseder.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

acdd052a

30 3月, 2009 2 次提交

cpumask: use set_cpu_active in init/main.c · 2b17fa50

由 Rusty Russell 提交于 3月 30, 2009

cpu_active_map is deprecated in favor of cpu_active_mask, which is
const for safety: we use accessors now (set_cpu_active) is we really
want to make a change.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

2b17fa50

cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL · 1a2142af

由 Rusty Russell 提交于 3月 30, 2009

Impact: cleanup

(Thanks to Al Viro for reminding me of this, via Ingo)

CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:

	#define CPU_MASK_ALL (cpumask_t) { { ... } }

Taking the address of such a temporary is questionable at best,
unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:

	#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)

Which formalizes this practice.  One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).

So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR
with the modern "cpu_all_mask" (a real const struct cpumask *).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NIngo Molnar <mingo@elte.hu>
Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>

1a2142af

26 3月, 2009 1 次提交

init,cpuset: fix initialize order · 759ee091

由 Lai Jiangshan 提交于 3月 25, 2009

Impact: cpuset_wq should be initialized after init_workqueues()

When I read /debugfs/tracing/trace_stat/workqueues,
I got this:

 # CPU  INSERTED  EXECUTED   NAME
 # |      |         |          |

   0      0          0       cpuset
   0    285        285       events/0
   0      2          2       work_on_cpu/0
   0   1115       1115       khelper
   0    325        325       kblockd/0
   0      0          0       kacpid
   0      0          0       kacpi_notify
   0      0          0       ata/0
   0      0          0       ata_aux
   0      0          0       ksuspend_usbd
   0      0          0       aio/0
   0      0          0       nfsiod
   0      0          0       kpsmoused
   0      0          0       kstriped
   0      0          0       kondemand/0
   0      1          1       hid_compat
   0      0          0       rpciod/0

   1     64         64       events/1
   1      2          2       work_on_cpu/1
   1      5          5       kblockd/1
   1      0          0       ata/1
   1      0          0       aio/1
   1      0          0       kondemand/1
   1      0          0       rpciod/1

I found "cpuset" is at the earliest.

I found a create_singlethread_workqueue() is earlier than
init_workqueues():

kernel_init()
->cpuset_init_smp()
  ->create_singlethread_workqueue()
->do_basic_setup()
  ->init_workqueues()

I think it's better that create_singlethread_workqueue() is called
after workqueue subsystem has been initialized.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: NSteven Rostedt <srostedt@redhat.com>
Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <menage@google.com>
Cc: miaoxie <miaox@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49C9F416.1050707@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

759ee091

26 2月, 2009 1 次提交

rcu: Teach RCU that idle task is not quiscent state at boot · a6826048

由 Paul E. McKenney 提交于 2月 25, 2009

This patch fixes a bug located by Vegard Nossum with the aid of
kmemcheck, updated based on review comments from Nick Piggin,
Ingo Molnar, and Andrew Morton.  And cleans up the variable-name
and function-name language.  ;-)

The boot CPU runs in the context of its idle thread during boot-up.
During this time, idle_cpu(0) will always return nonzero, which will
fool Classic and Hierarchical RCU into deciding that a large chunk of
the boot-up sequence is a big long quiescent state.  This in turn causes
RCU to prematurely end grace periods during this time.

This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
function to ignore the idle task as a quiescent state until the
system has started up the scheduler in rest_init(), introducing a
new non-API function rcu_idle_now_means_idle() to inform RCU of this
transition.  RCU maintains an internal rcu_idle_cpu_truthful variable
to track this state, which is then used by rcu_check_callback() to
determine if it should believe idle_cpu().

Because this patch has the effect of disallowing RCU grace periods
during long stretches of the boot-up sequence, this patch also introduces
Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
no-op if num_online_cpus() returns 1.  This allows boot-time code that
calls synchronize_rcu() to proceed normally.  Note, however, that RCU
callbacks registered by call_rcu() will likely queue up until later in
the boot sequence.  Although rcuclassic and rcutree can also use this
same optimization after boot completes, rcupreempt must restrict its
use of this optimization to the portion of the boot sequence before the
scheduler starts up, given that an rcupreempt RCU read-side critical
section may be preeempted.

In addition, this patch takes Nick Piggin's suggestion to make the
system_state global variable be __read_mostly.

Changes since v4:

o	Changes the name of the introduced function and variable to
	be less emotional.  ;-)

Changes since v3:

o	WARN_ON(nr_context_switches() > 0) to verify that RCU
	switches out of boot-time mode before the first context
	switch, as suggested by Nick Piggin.

Changes since v2:

o	Created rcu_blocking_is_gp() internal-to-RCU API that
	determines whether a call to synchronize_rcu() is itself
	a grace period.

o	The definition of rcu_blocking_is_gp() for rcuclassic and
	rcutree checks to see if but a single CPU is online.

o	The definition of rcu_blocking_is_gp() for rcupreempt
	checks to see both if but a single CPU is online and if
	the system is still in early boot.

	This allows rcupreempt to again work correctly if running
	on a single CPU after booting is complete.

o	Added check to rcupreempt's synchronize_sched() for there
	being but one online CPU.

Tested all three variants both SMP and !SMP, booted fine, passed a short
rcutorture test on both x86 and Power.
Located-by: NVegard Nossum <vegard.nossum@gmail.com>
Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a6826048

06 2月, 2009 1 次提交

smp, generic: introduce arch_disable_smp_support() instead of disable_ioapic_setup() · 65a4e574

由 Ingo Molnar 提交于 1月 31, 2009

Impact: cleanup

disable_ioapic_setup() in init/main.c is ugly as the function is
x86-specific. The #ifdef inline prototype there is ugly too.

Replace it with a generic arch_disable_smp_support() function - which
has a weak alias for non-x86 architectures and for non-ioapic x86 builds.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

65a4e574

08 1月, 2009 1 次提交

async: Asynchronous function calls to speed up kernel boot · 22a9d645

由 Arjan van de Ven 提交于 1月 07, 2009

Right now, most of the kernel boot is strictly synchronous, such that
various hardware delays are done sequentially.

In order to make the kernel boot faster, this patch introduces
infrastructure to allow doing some of the initialization steps
asynchronously, which will hide significant portions of the hardware delays
in practice.

In order to not change device order and other similar observables, this
patch does NOT do full parallel initialization.

Rather, it operates more in the way an out of order CPU does; the work may
be done out of order and asynchronous, but the observable effects
(instruction retiring for the CPU) are still done in the original sequence.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>

22a9d645

07 1月, 2009 3 次提交

init/main.c: mark late_time_init as __initdata · d2e3192b

由 Jan Beulich 提交于 1月 06, 2009

Signed-off-by: NJan Beulich <jbeulich@novell.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d2e3192b

Remove remaining unwinder code · f1883f86

由 Alexey Dobriyan 提交于 1月 06, 2009

Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Cc: Gabor Gombas <gombasg@sztaki.hu>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>,
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f1883f86

init: properly placing noinline keyword · f99ebf0a

由 Rakib Mullick 提交于 1月 06, 2009

checkpatch warns about 'static void noinline'.  It wants `static noinline
void'.

Both are permissible, but the kernel consistently uses `static inline' and
`static noinline', and consistency is good.  Hence let's keep the
checkpatch warning and fix up this code site.

[akpm@linux-foundation.org: rewrote changelog]
Signed-off-by: NMd.Rakib H. Mullick <rakib.mullick@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f99ebf0a

06 1月, 2009 1 次提交

trivial: add missing printk loglevel in start_kernel · 24d431d0

由 Ron Lee 提交于 11月 27, 2008

Add missing printk loglevel in start_kernel
Signed-off-by: NRon Lee <ron@debian.org>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>

24d431d0

03 1月, 2009 1 次提交

kbuild: Remove gcc 4.1.0 quirk from init/main.c · 609e5b71

由 Ingo Molnar 提交于 1月 02, 2009

Impact: cleanup

We now have a cleaner check for gcc 4.1.0/4.1.1 trouble in
include/linux/compiler-gcc4.h, so remove the 4.1.0 quirk from
init/main.c.
Reported-by: NBartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Acked-by: NSam Ravnborg <sam@ravnborg.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

609e5b71

01 1月, 2009 2 次提交

cpumask: Use find_last_bit() · e0c0ba73

由 Rusty Russell 提交于 1月 01, 2009

Impact: cleanup

There's one obvious place to use it: to find the highest possible cpu.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

e0c0ba73

cpumask: Use accessors code in core · 915441b6

由 Rusty Russell 提交于 1月 01, 2009

Impact: use new API

cpu_*_map are going away in favour of cpu_*_mask, but const pointers.
So we have accessors where we really do want to frob them.  Archs
will also need the (trivial) conversion before we can finally remove
cpu_*_map.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NMike Travis <travis@sgi.com>

915441b6

30 12月, 2008 1 次提交

tracing/kmemtrace: normalize the raw tracer event to the unified tracing API · 36994e58

由 Frederic Weisbecker 提交于 12月 29, 2008

Impact: new tracer plugin

This patch adapts kmemtrace raw events tracing to the unified tracing API.

To enable and use this tracer, just do the following:

 echo kmemtrace > /debugfs/tracing/current_tracer
 cat /debugfs/tracing/trace

You will have the following output:

 # tracer: kmemtrace
 #
 #
 # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
 # FREE   |      |     |       |              |   |            |        |
 # |

type_id 1 call_site 18446744071565527833 ptr 18446612134395152256
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 0 call_site 18446744071565636711 ptr 18446612134345164672 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 0 call_site 18446744071565636711 ptr 18446612134345164912 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 0 call_site 18446744071565636711 ptr 18446612134345165152 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
type_id 0 call_site 18446744071566144042 ptr 18446612134346191680 bytes_req 1304 bytes_alloc 1312 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
type_id 1 call_site 18446744071565585534 ptr 18446612134405955584

That was to stay backward compatible with the format output produced in
inux/tracepoint.h.

This is the default ouput, but note that I tried something else.

If you change an option:

echo kmem_minimalistic > /debugfs/trace_options

and then cat /debugfs/trace, you will have the following output:

 # tracer: kmemtrace
 #
 #
 # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
 # FREE   |      |     |       |              |   |            |        |
 # |

   -      C                            0xffff88007c088780          file_free_rcu
   +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
   -      C                            0xffff88007cad6000          putname
   +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
   +      K    240    240   000000d0   0xffff8800790dc780     -1   d_alloc
   -      C                            0xffff88007cad6000          putname
   +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
   +      K    240    240   000000d0   0xffff8800790dc870     -1   d_alloc
   -      C                            0xffff88007cad6000          putname
   +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
   +      K    240    240   000000d0   0xffff8800790dc960     -1   d_alloc
   +      K   1304   1312   000000d0   0xffff8800791d7340     -1   reiserfs_alloc_inode
   -      C                            0xffff88007cad6000          putname
   +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
   -      C                            0xffff88007cad6000          putname
   +      K    992   1000   000000d0   0xffff880079045b58     -1   alloc_inode
   +      K    768   1024   000080d0   0xffff88007c096400     -1   alloc_pipe_info
   +      K    240    240   000000d0   0xffff8800790dca50     -1   d_alloc
   +      K    272    320   000080d0   0xffff88007c088780     -1   get_empty_filp
   +      K    272    320   000080d0   0xffff88007c088000     -1   get_empty_filp

Yeah I shall confess kmem_minimalistic should be: kmem_alternative.

Whatever, I find it more readable but this a personal opinion of course.
We can drop it if you want.

On the ALLOC/FREE column, + means an allocation and - a free.

On the type column, you have K = kmalloc, C = cache, P = page

I would like the flags to be GFP_* strings but that would not be easy to not
break the column with strings....

About the node...it seems to always be -1. I don't know why but that shouldn't
be difficult to find.

I moved linux/tracepoint.h to trace/tracepoint.h as well. I think that would
be more easy to find the tracer headers if they are all in their common
directory.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

36994e58

29 12月, 2008 2 次提交

kmemtrace: Core implementation. · b9ce08c0

由 Eduard - Gabriel Munteanu 提交于 8月 10, 2008

kmemtrace provides tracing for slab allocator functions, such as kmalloc,
kfree, kmem_cache_alloc, kmem_cache_free etc.. Collected data is then fed
to the userspace application in order to analyse allocation hotspots,
internal fragmentation and so on, making it possible to see how well an
allocator performs, as well as debug and profile kernel code.
Signed-off-by: NEduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

b9ce08c0

sparseirq: move __weak symbols into separate compilation unit · 43a25632

由 Yinghai Lu 提交于 12月 28, 2008

GCC has a bug with __weak alias functions: if the functions are in
the same compilation unit as their call site, GCC can decide to
inline them - and thus rob the linker of the opportunity to override
the weak alias with the real thing.

So move all the IRQ handling related __weak symbols to kernel/irq/chip.c.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

43a25632

27 12月, 2008 1 次提交

sparseirq: work around compiler optimizing away __weak functions · 13a0c3c2

由 Yinghai Lu 提交于 12月 26, 2008

Impact: fix panic on null pointer with sparseirq

Some GCC versions seem to inline the weak global function,
when that function is empty.

Work it around, by making the functions return a (dummy) integer.
Signed-off-by: NYinghai <yinghai@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

13a0c3c2

08 12月, 2008 1 次提交

sparse irq_desc[] array: core kernel and x86 changes · 0b8f1efa

由 Yinghai Lu 提交于 12月 05, 2008

Impact: new feature

Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
NR_CPUS set to large values. The goal is to be able to scale up to much
larger NR_IRQS value without impacting the (important) common case.

To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
irq_desc pointers.

When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
this also makes the IRQ descriptors NUMA-local (to the site that calls
request_irq()).

This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
uses desc->chip_data for x86 to store irq_cfg.
Signed-off-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0b8f1efa

23 11月, 2008 1 次提交

init/main.c: use ktime accessor function in initcall_debug code · 1d926f27

由 Will Newton 提交于 11月 21, 2008

Impact: fix initcall debug output on non-scalar ktime platforms (32-bit embedded)

The initcall_debug code access the tv64 member of ktime.  This won't work
correctly for large deltas on platforms that don't use the scalar ktime
implementation.
Signed-off-by: NWill Newton <will.newton@gmail.com>
Acked-by: NTim Bird <tim.bird@am.sony.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1d926f27

14 11月, 2008 1 次提交

CRED: Inaugurate COW credentials · d84f4f99

由 David Howells 提交于 11月 14, 2008

Inaugurate copy-on-write credentials management.  This uses RCU to manage the
credentials pointer in the task_struct with respect to accesses by other tasks.
A process may only modify its own credentials, and so does not need locking to
access or modify its own credentials.

A mutex (cred_replace_mutex) is added to the task_struct to control the effect
of PTRACE_ATTACHED on credential calculations, particularly with respect to
execve().

With this patch, the contents of an active credentials struct may not be
changed directly; rather a new set of credentials must be prepared, modified
and committed using something like the following sequence of events:

	struct cred *new = prepare_creds();
	int ret = blah(new);
	if (ret < 0) {
		abort_creds(new);
		return ret;
	}
	return commit_creds(new);

There are some exceptions to this rule: the keyrings pointed to by the active
credentials may be instantiated - keyrings violate the COW rule as managing
COW keyrings is tricky, given that it is possible for a task to directly alter
the keys in a keyring in use by another task.

To help enforce this, various pointers to sets of credentials, such as those in
the task_struct, are declared const.  The purpose of this is compile-time
discouragement of altering credentials through those pointers.  Once a set of
credentials has been made public through one of these pointers, it may not be
modified, except under special circumstances:

  (1) Its reference count may incremented and decremented.

  (2) The keyrings to which it points may be modified, but not replaced.

The only safe way to modify anything else is to create a replacement and commit
using the functions described in Documentation/credentials.txt (which will be
added by a later patch).

This patch and the preceding patches have been tested with the LTP SELinux
testsuite.

This patch makes several logical sets of alteration:

 (1) execve().

     This now prepares and commits credentials in various places in the
     security code rather than altering the current creds directly.

 (2) Temporary credential overrides.

     do_coredump() and sys_faccessat() now prepare their own credentials and
     temporarily override the ones currently on the acting thread, whilst
     preventing interference from other threads by holding cred_replace_mutex
     on the thread being dumped.

     This will be replaced in a future patch by something that hands down the
     credentials directly to the functions being called, rather than altering
     the task's objective credentials.

 (3) LSM interface.

     A number of functions have been changed, added or removed:

     (*) security_capset_check(), ->capset_check()
     (*) security_capset_set(), ->capset_set()

     	 Removed in favour of security_capset().

     (*) security_capset(), ->capset()

     	 New.  This is passed a pointer to the new creds, a pointer to the old
     	 creds and the proposed capability sets.  It should fill in the new
     	 creds or return an error.  All pointers, barring the pointer to the
     	 new creds, are now const.

     (*) security_bprm_apply_creds(), ->bprm_apply_creds()

     	 Changed; now returns a value, which will cause the process to be
     	 killed if it's an error.

     (*) security_task_alloc(), ->task_alloc_security()

     	 Removed in favour of security_prepare_creds().

     (*) security_cred_free(), ->cred_free()

     	 New.  Free security data attached to cred->security.

     (*) security_prepare_creds(), ->cred_prepare()

     	 New. Duplicate any security data attached to cred->security.

     (*) security_commit_creds(), ->cred_commit()

     	 New. Apply any security effects for the upcoming installation of new
     	 security by commit_creds().

     (*) security_task_post_setuid(), ->task_post_setuid()

     	 Removed in favour of security_task_fix_setuid().

     (*) security_task_fix_setuid(), ->task_fix_setuid()

     	 Fix up the proposed new credentials for setuid().  This is used by
     	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
     	 setuid() changes.  Changes are made to the new credentials, rather
     	 than the task itself as in security_task_post_setuid().

     (*) security_task_reparent_to_init(), ->task_reparent_to_init()

     	 Removed.  Instead the task being reparented to init is referred
     	 directly to init's credentials.

	 NOTE!  This results in the loss of some state: SELinux's osid no
	 longer records the sid of the thread that forked it.

     (*) security_key_alloc(), ->key_alloc()
     (*) security_key_permission(), ->key_permission()

     	 Changed.  These now take cred pointers rather than task pointers to
     	 refer to the security context.

 (4) sys_capset().

     This has been simplified and uses less locking.  The LSM functions it
     calls have been merged.

 (5) reparent_to_kthreadd().

     This gives the current thread the same credentials as init by simply using
     commit_thread() to point that way.

 (6) __sigqueue_alloc() and switch_uid()

     __sigqueue_alloc() can't stop the target task from changing its creds
     beneath it, so this function gets a reference to the currently applicable
     user_struct which it then passes into the sigqueue struct it returns if
     successful.

     switch_uid() is now called from commit_creds(), and possibly should be
     folded into that.  commit_creds() should take care of protecting
     __sigqueue_alloc().

 (7) [sg]et[ug]id() and co and [sg]et_current_groups.

     The set functions now all use prepare_creds(), commit_creds() and
     abort_creds() to build and check a new set of credentials before applying
     it.

     security_task_set[ug]id() is called inside the prepared section.  This
     guarantees that nothing else will affect the creds until we've finished.

     The calling of set_dumpable() has been moved into commit_creds().

     Much of the functionality of set_user() has been moved into
     commit_creds().

     The get functions all simply access the data directly.

 (8) security_task_prctl() and cap_task_prctl().

     security_task_prctl() has been modified to return -ENOSYS if it doesn't
     want to handle a function, or otherwise return the return value directly
     rather than through an argument.

     Additionally, cap_task_prctl() now prepares a new set of credentials, even
     if it doesn't end up using it.

 (9) Keyrings.

     A number of changes have been made to the keyrings code:

     (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
     	 all been dropped and built in to the credentials functions directly.
     	 They may want separating out again later.

     (b) key_alloc() and search_process_keyrings() now take a cred pointer
     	 rather than a task pointer to specify the security context.

     (c) copy_creds() gives a new thread within the same thread group a new
     	 thread keyring if its parent had one, otherwise it discards the thread
     	 keyring.

     (d) The authorisation key now points directly to the credentials to extend
     	 the search into rather pointing to the task that carries them.

     (e) Installing thread, process or session keyrings causes a new set of
     	 credentials to be created, even though it's not strictly necessary for
     	 process or session keyrings (they're shared).

(10) Usermode helper.

     The usermode helper code now carries a cred struct pointer in its
     subprocess_info struct instead of a new session keyring pointer.  This set
     of credentials is derived from init_cred and installed on the new process
     after it has been cloned.

     call_usermodehelper_setup() allocates the new credentials and
     call_usermodehelper_freeinfo() discards them if they haven't been used.  A
     special cred function (prepare_usermodeinfo_creds()) is provided
     specifically for call_usermodehelper_setup() to call.

     call_usermodehelper_setkeys() adjusts the credentials to sport the
     supplied keyring as the new session keyring.

(11) SELinux.

     SELinux has a number of changes, in addition to those to support the LSM
     interface changes mentioned above:

     (a) selinux_setprocattr() no longer does its check for whether the
     	 current ptracer can access processes with the new SID inside the lock
     	 that covers getting the ptracer's SID.  Whilst this lock ensures that
     	 the check is done with the ptracer pinned, the result is only valid
     	 until the lock is released, so there's no point doing it inside the
     	 lock.

(12) is_single_threaded().

     This function has been extracted from selinux_setprocattr() and put into
     a file of its own in the lib/ directory as join_session_keyring() now
     wants to use it too.

     The code in SELinux just checked to see whether a task shared mm_structs
     with other tasks (CLONE_VM), but that isn't good enough.  We really want
     to know if they're part of the same thread group (CLONE_THREAD).

(13) nfsd.

     The NFS server daemon now has to use the COW credentials to set the
     credentials it is going to use.  It really needs to pass the credentials
     down to the functions it calls, but it can't do that until other patches
     in this series have been applied.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NJames Morris <jmorris@namei.org>
Signed-off-by: NJames Morris <jmorris@namei.org>

d84f4f99

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功