1. 04 9月, 2009 1 次提交
  2. 21 8月, 2009 1 次提交
    • I
      tracing: Fix too large stack usage in do_one_initcall() · 4a683bf9
      Ingo Molnar 提交于
      One of my testboxes triggered this nasty stack overflow crash
      during SCSI probing:
      
      [    5.874004] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
      [    5.875004] device: 'sda': device_add
      [    5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
      [    5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
      [    5.878004] *pde = 00000000
      [    5.878004] Thread overran stack, or stack corrupted
      [    5.878004] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [    5.878004] last sysfs file:
      [    5.878004]
      [    5.878004] Pid: 1, comm: swapper Not tainted (2.6.31-rc6-tip-01272-g9919e28-dirty #5685)
      [    5.878004] EIP: 0060:[<b1008321>] EFLAGS: 00010083 CPU: 0
      [    5.878004] EIP is at print_context_stack+0x81/0x110
      [    5.878004] EAX: cf8a3000 EBX: cf8a3fe4 ECX: 00000049 EDX: 00000000
      [    5.878004] ESI: b1cfce84 EDI: 00000000 EBP: cf8a3018 ESP: cf8a2ff4
      [    5.878004]  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      [    5.878004] Process swapper (pid: 1, ti=cf8a2000 task=cf8a8000 task.ti=cf8a3000)
      [    5.878004] Stack:
      [    5.878004]  b1004867 fffff000 cf8a3ffc
      [    5.878004] Call Trace:
      [    5.878004]  [<b1004867>] ? kernel_thread_helper+0x7/0x10
      [    5.878004] BUG: unable to handle kernel NULL pointer dereference at 00000a0c
      [    5.878004] IP: [<b1008321>] print_context_stack+0x81/0x110
      [    5.878004] *pde = 00000000
      [    5.878004] Thread overran stack, or stack corrupted
      [    5.878004] Oops: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC
      
      The oops did not reveal any more details about the real stack
      that we have and the system got into an infinite loop of
      recursive pagefaults.
      
      So i booted with CONFIG_STACK_TRACER=y and the 'stacktrace' boot
      parameter. The box did not crash (timings/conditions probably
      changed a tiny bit to trigger the catastrophic crash), but the
      /debug/tracing/stack_trace file was rather revealing:
      
              Depth    Size   Location    (72 entries)
              -----    ----   --------
        0)     3704      52   __change_page_attr+0xb8/0x290
        1)     3652      24   __change_page_attr_set_clr+0x43/0x90
        2)     3628      60   kernel_map_pages+0x108/0x120
        3)     3568      40   prep_new_page+0x7d/0x130
        4)     3528      84   get_page_from_freelist+0x106/0x420
        5)     3444     116   __alloc_pages_nodemask+0xd7/0x550
        6)     3328      36   allocate_slab+0xb1/0x100
        7)     3292      36   new_slab+0x1c/0x160
        8)     3256      36   __slab_alloc+0x133/0x2b0
        9)     3220       4   kmem_cache_alloc+0x1bb/0x1d0
       10)     3216     108   create_object+0x28/0x250
       11)     3108      40   kmemleak_alloc+0x81/0xc0
       12)     3068      24   kmem_cache_alloc+0x162/0x1d0
       13)     3044      52   scsi_pool_alloc_command+0x29/0x70
       14)     2992      20   scsi_host_alloc_command+0x22/0x70
       15)     2972      24   __scsi_get_command+0x1b/0x90
       16)     2948      28   scsi_get_command+0x35/0x90
       17)     2920      24   scsi_setup_blk_pc_cmnd+0xd4/0x100
       18)     2896     128   sd_prep_fn+0x332/0xa70
       19)     2768      36   blk_peek_request+0xe7/0x1d0
       20)     2732      56   scsi_request_fn+0x54/0x520
       21)     2676      12   __generic_unplug_device+0x2b/0x40
       22)     2664      24   blk_execute_rq_nowait+0x59/0x80
       23)     2640     172   blk_execute_rq+0x6b/0xb0
       24)     2468      32   scsi_execute+0xe0/0x140
       25)     2436      64   scsi_execute_req+0x152/0x160
       26)     2372      60   scsi_vpd_inquiry+0x6c/0x90
       27)     2312      44   scsi_get_vpd_page+0x112/0x160
       28)     2268      52   sd_revalidate_disk+0x1df/0x320
       29)     2216      92   rescan_partitions+0x98/0x330
       30)     2124      52   __blkdev_get+0x309/0x350
       31)     2072       8   blkdev_get+0xf/0x20
       32)     2064      44   register_disk+0xff/0x120
       33)     2020      36   add_disk+0x6e/0xb0
       34)     1984      44   sd_probe_async+0xfb/0x1d0
       35)     1940      44   __async_schedule+0xf4/0x1b0
       36)     1896       8   async_schedule+0x12/0x20
       37)     1888      60   sd_probe+0x305/0x360
       38)     1828      44   really_probe+0x63/0x170
       39)     1784      36   driver_probe_device+0x5d/0x60
       40)     1748      16   __device_attach+0x49/0x50
       41)     1732      32   bus_for_each_drv+0x5b/0x80
       42)     1700      24   device_attach+0x6b/0x70
       43)     1676      16   bus_attach_device+0x47/0x60
       44)     1660      76   device_add+0x33d/0x400
       45)     1584      52   scsi_sysfs_add_sdev+0x6a/0x2c0
       46)     1532     108   scsi_add_lun+0x44b/0x460
       47)     1424     116   scsi_probe_and_add_lun+0x182/0x4e0
       48)     1308      36   __scsi_add_device+0xd9/0xe0
       49)     1272      44   ata_scsi_scan_host+0x10b/0x190
       50)     1228      24   async_port_probe+0x96/0xd0
       51)     1204      44   __async_schedule+0xf4/0x1b0
       52)     1160       8   async_schedule+0x12/0x20
       53)     1152      48   ata_host_register+0x171/0x1d0
       54)     1104      60   ata_pci_sff_activate_host+0xf3/0x230
       55)     1044      44   ata_pci_sff_init_one+0xea/0x100
       56)     1000      48   amd_init_one+0xb2/0x190
       57)      952       8   local_pci_probe+0x13/0x20
       58)      944      32   pci_device_probe+0x68/0x90
       59)      912      44   really_probe+0x63/0x170
       60)      868      36   driver_probe_device+0x5d/0x60
       61)      832      20   __driver_attach+0x89/0xa0
       62)      812      32   bus_for_each_dev+0x5b/0x80
       63)      780      12   driver_attach+0x1e/0x20
       64)      768      72   bus_add_driver+0x14b/0x2d0
       65)      696      36   driver_register+0x6e/0x150
       66)      660      20   __pci_register_driver+0x53/0xc0
       67)      640       8   amd_init+0x14/0x16
       68)      632     572   do_one_initcall+0x2b/0x1d0
       69)       60      12   do_basic_setup+0x56/0x6a
       70)       48      20   kernel_init+0x84/0xce
       71)       28      28   kernel_thread_helper+0x7/0x10
      
      There's a lot of fat functions on that stack trace, but
      the largest of all is do_one_initcall(). This is due to
      the boot trace entry variables being on the stack.
      
      Fixing this is relatively easy, initcalls are fundamentally
      serialized, so we can move the local variables to file scope.
      
      Note that this large stack footprint was present for a
      couple of months already - what pushed my system over
      the edge was the addition of kmemleak to the call-chain:
      
        6)     3328      36   allocate_slab+0xb1/0x100
        7)     3292      36   new_slab+0x1c/0x160
        8)     3256      36   __slab_alloc+0x133/0x2b0
        9)     3220       4   kmem_cache_alloc+0x1bb/0x1d0
       10)     3216     108   create_object+0x28/0x250
       11)     3108      40   kmemleak_alloc+0x81/0xc0
       12)     3068      24   kmem_cache_alloc+0x162/0x1d0
       13)     3044      52   scsi_pool_alloc_command+0x29/0x70
      
      This pushes the total to ~3800 bytes, only a tiny bit
      more was needed to corrupt the on-kernel-stack thread_info.
      
      The fix reduces the stack footprint from 572 bytes
      to 28 bytes.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4a683bf9
  3. 14 8月, 2009 1 次提交
  4. 23 6月, 2009 1 次提交
  5. 19 6月, 2009 2 次提交
  6. 17 6月, 2009 2 次提交
    • B
      mm: Move pgtable_cache_init() earlier · c868d550
      Benjamin Herrenschmidt 提交于
      Some architectures need to initialize SLAB caches to be able
      to allocate page tables. They do that from pgtable_cache_init()
      so the later should be called earlier now, best is before
      vmalloc_init().
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c868d550
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
  7. 13 6月, 2009 2 次提交
  8. 12 6月, 2009 6 次提交
    • P
      slab,slub: don't enable interrupts during early boot · 7e85ee0c
      Pekka Enberg 提交于
      As explained by Benjamin Herrenschmidt:
      
        Oh and btw, your patch alone doesn't fix powerpc, because it's missing
        a whole bunch of GFP_KERNEL's in the arch code... You would have to
        grep the entire kernel for things that check slab_is_available() and
        even then you'll be missing some.
      
        For example, slab_is_available() didn't always exist, and so in the
        early days on powerpc, we used a mem_init_done global that is set form
        mem_init() (not perfect but works in practice). And we still have code
        using that to do the test.
      
      Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
      in early boot code to avoid enabling interrupts.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      7e85ee0c
    • K
      memcg: fix page_cgroup fatal error in FLATMEM · ca371c0d
      KAMEZAWA Hiroyuki 提交于
      Now, SLAB is configured in very early stage and it can be used in
      init routine now.
      
      But replacing alloc_bootmem() in FLAT/DISCONTIGMEM's page_cgroup()
      initialization breaks the allocation, now.
      (Works well in SPARSEMEM case...it supports MEMORY_HOTPLUG and
       size of page_cgroup is in reasonable size (< 1 << MAX_ORDER.)
      
      This patch revive FLATMEM+memory cgroup by using alloc_bootmem.
      
      In future,
      We stop to support FLATMEM (if no users) or rewrite codes for flatmem
      completely.But this will adds more messy codes and overheads.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Tested-by: NLi Zefan <lizf@cn.fujitsu.com>
      Tested-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      ca371c0d
    • P
      init: introduce mm_init() · 444f478f
      Pekka Enberg 提交于
      As suggested by Christoph Lameter, introduce mm_init() now that we initialize
      all the kernel memory allocations together.
      
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      444f478f
    • P
      vmalloc: use kzalloc() instead of alloc_bootmem() · 43ebdac4
      Pekka Enberg 提交于
      We can call vmalloc_init() after kmem_cache_init() and use kzalloc() instead of
      the bootmem allocator when initializing vmalloc data structures.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      43ebdac4
    • P
      slab: setup allocators earlier in the boot sequence · 83b519e8
      Pekka Enberg 提交于
      This patch makes kmalloc() available earlier in the boot sequence so we can get
      rid of some bootmem allocations. The bulk of the changes are due to
      kmem_cache_init() being called with interrupts disabled which requires some
      changes to allocator boostrap code.
      
      Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
      before we call mem_init() during boot as reported by Ingo Molnar:
      
        We have a hard crash in the WP-protect code:
      
        [    0.000000] Checking if this processor honours the WP bit even in supervisor mode...BUG: Int 14: CR2 ffcff000
        [    0.000000]      EDI 00000188  ESI 00000ac7  EBP c17eaf9c  ESP c17eaf8c
        [    0.000000]      EBX 000014e0  EDX 0000000e  ECX 01856067  EAX 00000001
        [    0.000000]      err 00000003  EIP c10135b1   CS 00000060  flg 00010002
        [    0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011fd f8616000 c18237cc
        [    0.000000]        00099800 c17bb000 c17eafec c17f1668 000001c5 c17f1322 c166e039 c1822bf0
        [    0.000000]        c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106a 00020800 01ba5003
        [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539-dirty #52203
        [    0.000000] Call Trace:
        [    0.000000]  [<c15357c2>] ? printk+0x14/0x16
        [    0.000000]  [<c10135b1>] ? do_test_wp_bit+0x19/0x23
        [    0.000000]  [<c17fd410>] ? test_wp_bit+0x26/0x64
        [    0.000000]  [<c17fd7e5>] ? mem_init+0x1ba/0x1d8
        [    0.000000]  [<c17f1668>] ? start_kernel+0x164/0x2f7
        [    0.000000]  [<c17f1322>] ? unknown_bootoption+0x0/0x19c
        [    0.000000]  [<c17f106a>] ? __init_begin+0x6a/0x6f
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      83b519e8
    • C
      kmemleak: Add the base support · 3c7b4e6b
      Catalin Marinas 提交于
      This patch adds the base support for the kernel memory leak
      detector. It traces the memory allocation/freeing in a way similar to
      the Boehm's conservative garbage collector, the difference being that
      the unreferenced objects are not freed but only shown in
      /sys/kernel/debug/kmemleak. Enabling this feature introduces an
      overhead to memory allocations.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3c7b4e6b
  9. 25 5月, 2009 1 次提交
  10. 17 4月, 2009 1 次提交
    • M
      Driver Core: early platform driver · 13977091
      Magnus Damm 提交于
      V3 of the early platform driver implementation.
      
      Platform drivers are great for embedded platforms because we can separate
      driver configuration from the actual driver.  So base addresses,
      interrupts and other configuration can be kept with the processor or board
      code, and the platform driver can be reused by many different platforms.
      
      For early devices we have nothing today.  For instance, to configure early
      timers and early serial ports we cannot use platform devices.  This
      because the setup order during boot.  Timers are needed before the
      platform driver core code is available.  The same goes for early printk
      support.  Early in this case means before initcalls.
      
      These early drivers today have their configuration either hard coded or
      they receive it using some special configuration method.  This is working
      quite well, but if we want to support both regular kernel modules and
      early devices then we need to have two ways of configuring the same
      driver.  A single way would be better.
      
      The early platform driver patch is basically a set of functions that allow
      drivers to register themselves and architecture code to locate them and
      probe.  Registration happens through early_param().  The time for the
      probe is decided by the architecture code.
      
      See Documentation/driver-model/platform.txt for more details.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMagnus Damm <damm@igel.co.jp>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: David Brownell <david-b@pacbell.net>
      Cc: Tejun Heo <htejun@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      13977091
  11. 12 4月, 2009 1 次提交
  12. 01 4月, 2009 1 次提交
  13. 30 3月, 2009 2 次提交
  14. 26 3月, 2009 1 次提交
    • L
      init,cpuset: fix initialize order · 759ee091
      Lai Jiangshan 提交于
      Impact: cpuset_wq should be initialized after init_workqueues()
      
      When I read /debugfs/tracing/trace_stat/workqueues,
      I got this:
      
       # CPU  INSERTED  EXECUTED   NAME
       # |      |         |          |
      
         0      0          0       cpuset
         0    285        285       events/0
         0      2          2       work_on_cpu/0
         0   1115       1115       khelper
         0    325        325       kblockd/0
         0      0          0       kacpid
         0      0          0       kacpi_notify
         0      0          0       ata/0
         0      0          0       ata_aux
         0      0          0       ksuspend_usbd
         0      0          0       aio/0
         0      0          0       nfsiod
         0      0          0       kpsmoused
         0      0          0       kstriped
         0      0          0       kondemand/0
         0      1          1       hid_compat
         0      0          0       rpciod/0
      
         1     64         64       events/1
         1      2          2       work_on_cpu/1
         1      5          5       kblockd/1
         1      0          0       ata/1
         1      0          0       aio/1
         1      0          0       kondemand/1
         1      0          0       rpciod/1
      
      I found "cpuset" is at the earliest.
      
      I found a create_singlethread_workqueue() is earlier than
      init_workqueues():
      
      kernel_init()
      ->cpuset_init_smp()
        ->create_singlethread_workqueue()
      ->do_basic_setup()
        ->init_workqueues()
      
      I think it's better that create_singlethread_workqueue() is called
      after workqueue subsystem has been initialized.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Menage <menage@google.com>
      Cc: miaoxie <miaox@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <49C9F416.1050707@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      759ee091
  15. 26 2月, 2009 1 次提交
    • P
      rcu: Teach RCU that idle task is not quiscent state at boot · a6826048
      Paul E. McKenney 提交于
      This patch fixes a bug located by Vegard Nossum with the aid of
      kmemcheck, updated based on review comments from Nick Piggin,
      Ingo Molnar, and Andrew Morton.  And cleans up the variable-name
      and function-name language.  ;-)
      
      The boot CPU runs in the context of its idle thread during boot-up.
      During this time, idle_cpu(0) will always return nonzero, which will
      fool Classic and Hierarchical RCU into deciding that a large chunk of
      the boot-up sequence is a big long quiescent state.  This in turn causes
      RCU to prematurely end grace periods during this time.
      
      This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
      function to ignore the idle task as a quiescent state until the
      system has started up the scheduler in rest_init(), introducing a
      new non-API function rcu_idle_now_means_idle() to inform RCU of this
      transition.  RCU maintains an internal rcu_idle_cpu_truthful variable
      to track this state, which is then used by rcu_check_callback() to
      determine if it should believe idle_cpu().
      
      Because this patch has the effect of disallowing RCU grace periods
      during long stretches of the boot-up sequence, this patch also introduces
      Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
      no-op if num_online_cpus() returns 1.  This allows boot-time code that
      calls synchronize_rcu() to proceed normally.  Note, however, that RCU
      callbacks registered by call_rcu() will likely queue up until later in
      the boot sequence.  Although rcuclassic and rcutree can also use this
      same optimization after boot completes, rcupreempt must restrict its
      use of this optimization to the portion of the boot sequence before the
      scheduler starts up, given that an rcupreempt RCU read-side critical
      section may be preeempted.
      
      In addition, this patch takes Nick Piggin's suggestion to make the
      system_state global variable be __read_mostly.
      
      Changes since v4:
      
      o	Changes the name of the introduced function and variable to
      	be less emotional.  ;-)
      
      Changes since v3:
      
      o	WARN_ON(nr_context_switches() > 0) to verify that RCU
      	switches out of boot-time mode before the first context
      	switch, as suggested by Nick Piggin.
      
      Changes since v2:
      
      o	Created rcu_blocking_is_gp() internal-to-RCU API that
      	determines whether a call to synchronize_rcu() is itself
      	a grace period.
      
      o	The definition of rcu_blocking_is_gp() for rcuclassic and
      	rcutree checks to see if but a single CPU is online.
      
      o	The definition of rcu_blocking_is_gp() for rcupreempt
      	checks to see both if but a single CPU is online and if
      	the system is still in early boot.
      
      	This allows rcupreempt to again work correctly if running
      	on a single CPU after booting is complete.
      
      o	Added check to rcupreempt's synchronize_sched() for there
      	being but one online CPU.
      
      Tested all three variants both SMP and !SMP, booted fine, passed a short
      rcutorture test on both x86 and Power.
      Located-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a6826048
  16. 06 2月, 2009 1 次提交
  17. 08 1月, 2009 1 次提交
    • A
      async: Asynchronous function calls to speed up kernel boot · 22a9d645
      Arjan van de Ven 提交于
      Right now, most of the kernel boot is strictly synchronous, such that
      various hardware delays are done sequentially.
      
      In order to make the kernel boot faster, this patch introduces
      infrastructure to allow doing some of the initialization steps
      asynchronously, which will hide significant portions of the hardware delays
      in practice.
      
      In order to not change device order and other similar observables, this
      patch does NOT do full parallel initialization.
      
      Rather, it operates more in the way an out of order CPU does; the work may
      be done out of order and asynchronous, but the observable effects
      (instruction retiring for the CPU) are still done in the original sequence.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      22a9d645
  18. 07 1月, 2009 3 次提交
  19. 06 1月, 2009 1 次提交
  20. 03 1月, 2009 1 次提交
  21. 01 1月, 2009 2 次提交
  22. 30 12月, 2008 1 次提交
    • F
      tracing/kmemtrace: normalize the raw tracer event to the unified tracing API · 36994e58
      Frederic Weisbecker 提交于
      Impact: new tracer plugin
      
      This patch adapts kmemtrace raw events tracing to the unified tracing API.
      
      To enable and use this tracer, just do the following:
      
       echo kmemtrace > /debugfs/tracing/current_tracer
       cat /debugfs/tracing/trace
      
      You will have the following output:
      
       # tracer: kmemtrace
       #
       #
       # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
       # FREE   |      |     |       |              |   |            |        |
       # |
      
      type_id 1 call_site 18446744071565527833 ptr 18446612134395152256
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345164672 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345164912 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 0 call_site 18446744071565636711 ptr 18446612134345165152 bytes_req 240 bytes_alloc 240 gfp_flags 208 node -1
      type_id 0 call_site 18446744071566144042 ptr 18446612134346191680 bytes_req 1304 bytes_alloc 1312 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      type_id 0 call_site 18446744071565585597 ptr 18446612134405955584 bytes_req 4096 bytes_alloc 4096 gfp_flags 208 node -1
      type_id 1 call_site 18446744071565585534 ptr 18446612134405955584
      
      That was to stay backward compatible with the format output produced in
      inux/tracepoint.h.
      
      This is the default ouput, but note that I tried something else.
      
      If you change an option:
      
      echo kmem_minimalistic > /debugfs/trace_options
      
      and then cat /debugfs/trace, you will have the following output:
      
       # tracer: kmemtrace
       #
       #
       # ALLOC  TYPE  REQ   GIVEN  FLAGS           POINTER         NODE    CALLER
       # FREE   |      |     |       |              |   |            |        |
       # |
      
         -      C                            0xffff88007c088780          file_free_rcu
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc780     -1   d_alloc
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc870     -1   d_alloc
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         +      K    240    240   000000d0   0xffff8800790dc960     -1   d_alloc
         +      K   1304   1312   000000d0   0xffff8800791d7340     -1   reiserfs_alloc_inode
         -      C                            0xffff88007cad6000          putname
         +      K   4096   4096   000000d0   0xffff88007cad6000     -1   getname
         -      C                            0xffff88007cad6000          putname
         +      K    992   1000   000000d0   0xffff880079045b58     -1   alloc_inode
         +      K    768   1024   000080d0   0xffff88007c096400     -1   alloc_pipe_info
         +      K    240    240   000000d0   0xffff8800790dca50     -1   d_alloc
         +      K    272    320   000080d0   0xffff88007c088780     -1   get_empty_filp
         +      K    272    320   000080d0   0xffff88007c088000     -1   get_empty_filp
      
      Yeah I shall confess kmem_minimalistic should be: kmem_alternative.
      
      Whatever, I find it more readable but this a personal opinion of course.
      We can drop it if you want.
      
      On the ALLOC/FREE column, + means an allocation and - a free.
      
      On the type column, you have K = kmalloc, C = cache, P = page
      
      I would like the flags to be GFP_* strings but that would not be easy to not
      break the column with strings....
      
      About the node...it seems to always be -1. I don't know why but that shouldn't
      be difficult to find.
      
      I moved linux/tracepoint.h to trace/tracepoint.h as well. I think that would
      be more easy to find the tracer headers if they are all in their common
      directory.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      36994e58
  23. 29 12月, 2008 2 次提交
  24. 27 12月, 2008 1 次提交
  25. 08 12月, 2008 1 次提交
    • Y
      sparse irq_desc[] array: core kernel and x86 changes · 0b8f1efa
      Yinghai Lu 提交于
      Impact: new feature
      
      Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
      NR_CPUS set to large values. The goal is to be able to scale up to much
      larger NR_IRQS value without impacting the (important) common case.
      
      To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
      irq_desc pointers.
      
      When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
      this also makes the IRQ descriptors NUMA-local (to the site that calls
      request_irq()).
      
      This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
      uses desc->chip_data for x86 to store irq_cfg.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0b8f1efa
  26. 23 11月, 2008 1 次提交
  27. 14 11月, 2008 1 次提交
    • D
      CRED: Inaugurate COW credentials · d84f4f99
      David Howells 提交于
      Inaugurate copy-on-write credentials management.  This uses RCU to manage the
      credentials pointer in the task_struct with respect to accesses by other tasks.
      A process may only modify its own credentials, and so does not need locking to
      access or modify its own credentials.
      
      A mutex (cred_replace_mutex) is added to the task_struct to control the effect
      of PTRACE_ATTACHED on credential calculations, particularly with respect to
      execve().
      
      With this patch, the contents of an active credentials struct may not be
      changed directly; rather a new set of credentials must be prepared, modified
      and committed using something like the following sequence of events:
      
      	struct cred *new = prepare_creds();
      	int ret = blah(new);
      	if (ret < 0) {
      		abort_creds(new);
      		return ret;
      	}
      	return commit_creds(new);
      
      There are some exceptions to this rule: the keyrings pointed to by the active
      credentials may be instantiated - keyrings violate the COW rule as managing
      COW keyrings is tricky, given that it is possible for a task to directly alter
      the keys in a keyring in use by another task.
      
      To help enforce this, various pointers to sets of credentials, such as those in
      the task_struct, are declared const.  The purpose of this is compile-time
      discouragement of altering credentials through those pointers.  Once a set of
      credentials has been made public through one of these pointers, it may not be
      modified, except under special circumstances:
      
        (1) Its reference count may incremented and decremented.
      
        (2) The keyrings to which it points may be modified, but not replaced.
      
      The only safe way to modify anything else is to create a replacement and commit
      using the functions described in Documentation/credentials.txt (which will be
      added by a later patch).
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           This now prepares and commits credentials in various places in the
           security code rather than altering the current creds directly.
      
       (2) Temporary credential overrides.
      
           do_coredump() and sys_faccessat() now prepare their own credentials and
           temporarily override the ones currently on the acting thread, whilst
           preventing interference from other threads by holding cred_replace_mutex
           on the thread being dumped.
      
           This will be replaced in a future patch by something that hands down the
           credentials directly to the functions being called, rather than altering
           the task's objective credentials.
      
       (3) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_capset_check(), ->capset_check()
           (*) security_capset_set(), ->capset_set()
      
           	 Removed in favour of security_capset().
      
           (*) security_capset(), ->capset()
      
           	 New.  This is passed a pointer to the new creds, a pointer to the old
           	 creds and the proposed capability sets.  It should fill in the new
           	 creds or return an error.  All pointers, barring the pointer to the
           	 new creds, are now const.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
      
           	 Changed; now returns a value, which will cause the process to be
           	 killed if it's an error.
      
           (*) security_task_alloc(), ->task_alloc_security()
      
           	 Removed in favour of security_prepare_creds().
      
           (*) security_cred_free(), ->cred_free()
      
           	 New.  Free security data attached to cred->security.
      
           (*) security_prepare_creds(), ->cred_prepare()
      
           	 New. Duplicate any security data attached to cred->security.
      
           (*) security_commit_creds(), ->cred_commit()
      
           	 New. Apply any security effects for the upcoming installation of new
           	 security by commit_creds().
      
           (*) security_task_post_setuid(), ->task_post_setuid()
      
           	 Removed in favour of security_task_fix_setuid().
      
           (*) security_task_fix_setuid(), ->task_fix_setuid()
      
           	 Fix up the proposed new credentials for setuid().  This is used by
           	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
           	 setuid() changes.  Changes are made to the new credentials, rather
           	 than the task itself as in security_task_post_setuid().
      
           (*) security_task_reparent_to_init(), ->task_reparent_to_init()
      
           	 Removed.  Instead the task being reparented to init is referred
           	 directly to init's credentials.
      
      	 NOTE!  This results in the loss of some state: SELinux's osid no
      	 longer records the sid of the thread that forked it.
      
           (*) security_key_alloc(), ->key_alloc()
           (*) security_key_permission(), ->key_permission()
      
           	 Changed.  These now take cred pointers rather than task pointers to
           	 refer to the security context.
      
       (4) sys_capset().
      
           This has been simplified and uses less locking.  The LSM functions it
           calls have been merged.
      
       (5) reparent_to_kthreadd().
      
           This gives the current thread the same credentials as init by simply using
           commit_thread() to point that way.
      
       (6) __sigqueue_alloc() and switch_uid()
      
           __sigqueue_alloc() can't stop the target task from changing its creds
           beneath it, so this function gets a reference to the currently applicable
           user_struct which it then passes into the sigqueue struct it returns if
           successful.
      
           switch_uid() is now called from commit_creds(), and possibly should be
           folded into that.  commit_creds() should take care of protecting
           __sigqueue_alloc().
      
       (7) [sg]et[ug]id() and co and [sg]et_current_groups.
      
           The set functions now all use prepare_creds(), commit_creds() and
           abort_creds() to build and check a new set of credentials before applying
           it.
      
           security_task_set[ug]id() is called inside the prepared section.  This
           guarantees that nothing else will affect the creds until we've finished.
      
           The calling of set_dumpable() has been moved into commit_creds().
      
           Much of the functionality of set_user() has been moved into
           commit_creds().
      
           The get functions all simply access the data directly.
      
       (8) security_task_prctl() and cap_task_prctl().
      
           security_task_prctl() has been modified to return -ENOSYS if it doesn't
           want to handle a function, or otherwise return the return value directly
           rather than through an argument.
      
           Additionally, cap_task_prctl() now prepares a new set of credentials, even
           if it doesn't end up using it.
      
       (9) Keyrings.
      
           A number of changes have been made to the keyrings code:
      
           (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
           	 all been dropped and built in to the credentials functions directly.
           	 They may want separating out again later.
      
           (b) key_alloc() and search_process_keyrings() now take a cred pointer
           	 rather than a task pointer to specify the security context.
      
           (c) copy_creds() gives a new thread within the same thread group a new
           	 thread keyring if its parent had one, otherwise it discards the thread
           	 keyring.
      
           (d) The authorisation key now points directly to the credentials to extend
           	 the search into rather pointing to the task that carries them.
      
           (e) Installing thread, process or session keyrings causes a new set of
           	 credentials to be created, even though it's not strictly necessary for
           	 process or session keyrings (they're shared).
      
      (10) Usermode helper.
      
           The usermode helper code now carries a cred struct pointer in its
           subprocess_info struct instead of a new session keyring pointer.  This set
           of credentials is derived from init_cred and installed on the new process
           after it has been cloned.
      
           call_usermodehelper_setup() allocates the new credentials and
           call_usermodehelper_freeinfo() discards them if they haven't been used.  A
           special cred function (prepare_usermodeinfo_creds()) is provided
           specifically for call_usermodehelper_setup() to call.
      
           call_usermodehelper_setkeys() adjusts the credentials to sport the
           supplied keyring as the new session keyring.
      
      (11) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) selinux_setprocattr() no longer does its check for whether the
           	 current ptracer can access processes with the new SID inside the lock
           	 that covers getting the ptracer's SID.  Whilst this lock ensures that
           	 the check is done with the ptracer pinned, the result is only valid
           	 until the lock is released, so there's no point doing it inside the
           	 lock.
      
      (12) is_single_threaded().
      
           This function has been extracted from selinux_setprocattr() and put into
           a file of its own in the lib/ directory as join_session_keyring() now
           wants to use it too.
      
           The code in SELinux just checked to see whether a task shared mm_structs
           with other tasks (CLONE_VM), but that isn't good enough.  We really want
           to know if they're part of the same thread group (CLONE_THREAD).
      
      (13) nfsd.
      
           The NFS server daemon now has to use the COW credentials to set the
           credentials it is going to use.  It really needs to pass the credentials
           down to the functions it calls, but it can't do that until other patches
           in this series have been applied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d84f4f99