1. 25 12月, 2008 1 次提交
    • I
      rcu: provide RCU options on non-preempt architectures too · 12d79baf
      Ingo Molnar 提交于
      Impact: build fix
      
      Some old architectures still do not use kernel/Kconfig.preempt, so the
      moving of the RCU options there broke their build:
      
       In file included from /home/mingo/tip/include/linux/sem.h:81,
                       from /home/mingo/tip/include/linux/sched.h:69,
                       from /home/mingo/tip/arch/alpha/kernel/asm-offsets.c:9:
       /home/mingo/tip/include/linux/rcupdate.h:62:2: error: #error "Unknown RCU implementation specified to kernel configuration"
      
      Move these options back to init/Kconfig, which every architecture
      includes.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      12d79baf
  2. 20 12月, 2008 1 次提交
    • J
      allow stripping of generated symbols under CONFIG_KALLSYMS_ALL · 9bb48247
      Jan Beulich 提交于
      Building upon parts of the module stripping patch, this patch
      introduces similar stripping for vmlinux when CONFIG_KALLSYMS_ALL=y.
      Using CONFIG_KALLSYMS_STRIP_GENERATED reduces the overhead of
      CONFIG_KALLSYMS_ALL from 245k/310k to 65k/80k for the (i386/x86-64)
      kernels I tested with.
      
      The patch also does away with the need to special case the kallsyms-
      internal symbols by making them available even in the first linking
      stage.
      
      While it is a generated file, the patch includes the changes to
      scripts/genksyms/keywords.c_shipped, as I'm unsure what the procedure
      here is.
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
      9bb48247
  3. 19 12月, 2008 1 次提交
    • P
      "Tree RCU": scalable classic RCU implementation · 64db4cff
      Paul E. McKenney 提交于
      This patch fixes a long-standing performance bug in classic RCU that
      results in massive internal-to-RCU lock contention on systems with
      more than a few hundred CPUs.  Although this patch creates a separate
      flavor of RCU for ease of review and patch maintenance, it is intended
      to replace classic RCU.
      
      This patch still handles stress better than does mainline, so I am still
      calling it ready for inclusion.  This patch is against the -tip tree.
      Nevertheless, experience on an actual 1000+ CPU machine would still be
      most welcome.
      
      Most of the changes noted below were found while creating an rcutiny
      (which should permit ejecting the current rcuclassic) and while doing
      detailed line-by-line documentation.
      
      Updates from v9 (http://lkml.org/lkml/2008/12/2/334):
      
      o	Fixes from remainder of line-by-line code walkthrough,
      	including comment spelling, initialization, undesirable
      	narrowing due to type conversion, removing redundant memory
      	barriers, removing redundant local-variable initialization,
      	and removing redundant local variables.
      
      	I do not believe that any of these fixes address the CPU-hotplug
      	issues that Andi Kleen was seeing, but please do give it a whirl
      	in case the machine is smarter than I am.
      
      	A writeup from the walkthrough may be found at the following
      	URL, in case you are suffering from terminal insomnia or
      	masochism:
      
      	http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf
      
      o	Made rcutree tracing use seq_file, as suggested some time
      	ago by Lai Jiangshan.
      
      o	Added a .csv variant of the rcudata debugfs trace file, to allow
      	people having thousands of CPUs to drop the data into
      	a spreadsheet.	Tested with oocalc and gnumeric.  Updated
      	documentation to suit.
      
      Updates from v8 (http://lkml.org/lkml/2008/11/15/139):
      
      o	Fix a theoretical race between grace-period initialization and
      	force_quiescent_state() that could occur if more than three
      	jiffies were required to carry out the grace-period
      	initialization.  Which it might, if you had enough CPUs.
      
      o	Apply Ingo's printk-standardization patch.
      
      o	Substitute local variables for repeated accesses to global
      	variables.
      
      o	Fix comment misspellings and redundant (but harmless) increments
      	of ->n_rcu_pending (this latter after having explicitly added it).
      
      o	Apply checkpatch fixes.
      
      Updates from v7 (http://lkml.org/lkml/2008/10/10/291):
      
      o	Fixed a number of problems noted by Gautham Shenoy, including
      	the cpu-stall-detection bug that he was having difficulty
      	convincing me was real.  ;-)
      
      o	Changed cpu-stall detection to wait for ten seconds rather than
      	three in order to reduce false positive, as suggested by Ingo
      	Molnar.
      
      o	Produced a design document (http://lwn.net/Articles/305782/).
      	The act of writing this document uncovered a number of both
      	theoretical and "here and now" bugs as noted below.
      
      o	Fix dynticks_nesting accounting confusion, simplify WARN_ON()
      	condition, fix kerneldoc comments, and add memory barriers
      	in dynticks interface functions.
      
      o	Add more data to tracing.
      
      o	Remove unused "rcu_barrier" field from rcu_data structure.
      
      o	Count calls to rcu_pending() from scheduling-clock interrupt
      	to use as a surrogate timebase should jiffies stop counting.
      
      o	Fix a theoretical race between force_quiescent_state() and
      	grace-period initialization.  Yes, initialization does have to
      	go on for some jiffies for this race to occur, but given enough
      	CPUs...
      
      Updates from v6 (http://lkml.org/lkml/2008/9/23/448):
      
      o	Fix a number of checkpatch.pl complaints.
      
      o	Apply review comments from Ingo Molnar and Lai Jiangshan
      	on the stall-detection code.
      
      o	Fix several bugs in !CONFIG_SMP builds.
      
      o	Fix a misspelled config-parameter name so that RCU now announces
      	at boot time if stall detection is configured.
      
      o	Run tests on numerous combinations of configurations parameters,
      	which after the fixes above, now build and run correctly.
      
      Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):
      
      o	Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
      	changeset some time ago, and finally got around to retesting
      	this option).
      
      o	Fix some tracing bugs in rcupreempt that caused incorrect
      	totals to be printed.
      
      o	I now test with a more brutal random-selection online/offline
      	script (attached).  Probably more brutal than it needs to be
      	on the people reading it as well, but so it goes.
      
      o	A number of optimizations and usability improvements:
      
      	o	Make rcu_pending() ignore the grace-period timeout when
      		there is no grace period in progress.
      
      	o	Make force_quiescent_state() avoid going for a global
      		lock in the case where there is no grace period in
      		progress.
      
      	o	Rearrange struct fields to improve struct layout.
      
      	o	Make call_rcu() initiate a grace period if RCU was
      		idle, rather than waiting for the next scheduling
      		clock interrupt.
      
      	o	Invoke rcu_irq_enter() and rcu_irq_exit() only when
      		idle, as suggested by Andi Kleen.  I still don't
      		completely trust this change, and might back it out.
      
      	o	Make CONFIG_RCU_TRACE be the single config variable
      		manipulated for all forms of RCU, instead of the prior
      		confusion.
      
      	o	Document tracing files and formats for both rcupreempt
      		and rcutree.
      
      Updates from v4 for those missing v5 given its bad subject line:
      
      o	Separated dynticks interface so that NMIs and irqs call separate
      	functions, greatly simplifying it.  In particular, this code
      	no longer requires a proof of correctness.  ;-)
      
      o	Separated dynticks state out into its own per-CPU structure,
      	avoiding the duplicated accounting.
      
      o	The case where a dynticks-idle CPU runs an irq handler that
      	invokes call_rcu() is now correctly handled, forcing that CPU
      	out of dynticks-idle mode.
      
      o	Review comments have been applied (thank you all!!!).
      	For but one example, fixed the dynticks-ordering issue that
      	Manfred pointed out, saving me much debugging.  ;-)
      
      o	Adjusted rcuclassic and rcupreempt to handle dynticks changes.
      
      Attached is an updated patch to Classic RCU that applies a hierarchy,
      greatly reducing the contention on the top-level lock for large machines.
      This passes 10-hour concurrent rcutorture and online-offline testing on
      128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
      bugs in presence of dynticks (exciting working on a system where
      "sleep 1" hangs until interrupted...), which were fixed in the
      2.6.27 kernel.  It is getting more reliable than mainline by some
      measures, so the next version will be against -tip for inclusion.
      See also Manfred Spraul's recent patches (or his earlier work from
      2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
      We will converge onto a common patch in the fullness of time, but are
      currently exploring different regions of the design space.  That said,
      I have already gratefully stolen quite a few of Manfred's ideas.
      
      This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
      of the RCU hierarchy.  Defaults to 32 on 32-bit machines and 64 on
      64-bit machines.  If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
      there is no hierarchy.  By default, the RCU initialization code will
      adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
      architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
      this balancing, allowing the hierarchy to be exactly aligned to the
      underlying hardware.  Up to two levels of hierarchy are permitted
      (in addition to the root node), allowing up to 16,384 CPUs on 32-bit
      systems and up to 262,144 CPUs on 64-bit systems.  I just know that I
      am going to regret saying this, but this seems more than sufficient
      for the foreseeable future.  (Some architectures might wish to set
      CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
      If this becomes a real problem, additional levels can be added, but I
      doubt that it will make a significant difference on real hardware.)
      
      In the common case, a given CPU will manipulate its private rcu_data
      structure and the rcu_node structure that it shares with its immediate
      neighbors.  This can reduce both lock and memory contention by multiple
      orders of magnitude, which should eliminate the need for the strange
      manipulations that are reported to be required when running Linux on
      very large systems.
      
      Some shortcomings:
      
      o	More bugs will probably surface as a result of an ongoing
      	line-by-line code inspection.
      
      	Patches will be provided as required.
      
      o	There are probably hangs, rcutorture failures, &c.  Seems
      	quite stable on a 128-CPU machine, but that is kind of small
      	compared to 4096 CPUs.  However, seems to do better than
      	mainline.
      
      	Patches will be provided as required.
      
      o	The memory footprint of this version is several KB larger
      	than rcuclassic.
      
      	A separate UP-only rcutiny patch will be provided, which will
      	reduce the memory footprint significantly, even compared
      	to the old rcuclassic.  One such patch passes light testing,
      	and has a memory footprint smaller even than rcuclassic.
      	Initial reaction from various embedded guys was "it is not
      	worth it", so am putting it aside.
      
      Credits:
      
      o	Manfred Spraul for ideas, review comments, and bugs spotted,
      	as well as some good friendly competition.  ;-)
      
      o	Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
      	Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
      	for reviews and comments.
      
      o	Thomas Gleixner for much-needed help with some timer issues
      	(see patches below).
      
      o	Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
      	Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
      	Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
      	alive despite my heavy abuse^Wtesting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      64db4cff
  4. 23 11月, 2008 1 次提交
  5. 16 11月, 2008 1 次提交
  6. 14 11月, 2008 2 次提交
    • D
      CRED: Inaugurate COW credentials · d84f4f99
      David Howells 提交于
      Inaugurate copy-on-write credentials management.  This uses RCU to manage the
      credentials pointer in the task_struct with respect to accesses by other tasks.
      A process may only modify its own credentials, and so does not need locking to
      access or modify its own credentials.
      
      A mutex (cred_replace_mutex) is added to the task_struct to control the effect
      of PTRACE_ATTACHED on credential calculations, particularly with respect to
      execve().
      
      With this patch, the contents of an active credentials struct may not be
      changed directly; rather a new set of credentials must be prepared, modified
      and committed using something like the following sequence of events:
      
      	struct cred *new = prepare_creds();
      	int ret = blah(new);
      	if (ret < 0) {
      		abort_creds(new);
      		return ret;
      	}
      	return commit_creds(new);
      
      There are some exceptions to this rule: the keyrings pointed to by the active
      credentials may be instantiated - keyrings violate the COW rule as managing
      COW keyrings is tricky, given that it is possible for a task to directly alter
      the keys in a keyring in use by another task.
      
      To help enforce this, various pointers to sets of credentials, such as those in
      the task_struct, are declared const.  The purpose of this is compile-time
      discouragement of altering credentials through those pointers.  Once a set of
      credentials has been made public through one of these pointers, it may not be
      modified, except under special circumstances:
      
        (1) Its reference count may incremented and decremented.
      
        (2) The keyrings to which it points may be modified, but not replaced.
      
      The only safe way to modify anything else is to create a replacement and commit
      using the functions described in Documentation/credentials.txt (which will be
      added by a later patch).
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           This now prepares and commits credentials in various places in the
           security code rather than altering the current creds directly.
      
       (2) Temporary credential overrides.
      
           do_coredump() and sys_faccessat() now prepare their own credentials and
           temporarily override the ones currently on the acting thread, whilst
           preventing interference from other threads by holding cred_replace_mutex
           on the thread being dumped.
      
           This will be replaced in a future patch by something that hands down the
           credentials directly to the functions being called, rather than altering
           the task's objective credentials.
      
       (3) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_capset_check(), ->capset_check()
           (*) security_capset_set(), ->capset_set()
      
           	 Removed in favour of security_capset().
      
           (*) security_capset(), ->capset()
      
           	 New.  This is passed a pointer to the new creds, a pointer to the old
           	 creds and the proposed capability sets.  It should fill in the new
           	 creds or return an error.  All pointers, barring the pointer to the
           	 new creds, are now const.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
      
           	 Changed; now returns a value, which will cause the process to be
           	 killed if it's an error.
      
           (*) security_task_alloc(), ->task_alloc_security()
      
           	 Removed in favour of security_prepare_creds().
      
           (*) security_cred_free(), ->cred_free()
      
           	 New.  Free security data attached to cred->security.
      
           (*) security_prepare_creds(), ->cred_prepare()
      
           	 New. Duplicate any security data attached to cred->security.
      
           (*) security_commit_creds(), ->cred_commit()
      
           	 New. Apply any security effects for the upcoming installation of new
           	 security by commit_creds().
      
           (*) security_task_post_setuid(), ->task_post_setuid()
      
           	 Removed in favour of security_task_fix_setuid().
      
           (*) security_task_fix_setuid(), ->task_fix_setuid()
      
           	 Fix up the proposed new credentials for setuid().  This is used by
           	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
           	 setuid() changes.  Changes are made to the new credentials, rather
           	 than the task itself as in security_task_post_setuid().
      
           (*) security_task_reparent_to_init(), ->task_reparent_to_init()
      
           	 Removed.  Instead the task being reparented to init is referred
           	 directly to init's credentials.
      
      	 NOTE!  This results in the loss of some state: SELinux's osid no
      	 longer records the sid of the thread that forked it.
      
           (*) security_key_alloc(), ->key_alloc()
           (*) security_key_permission(), ->key_permission()
      
           	 Changed.  These now take cred pointers rather than task pointers to
           	 refer to the security context.
      
       (4) sys_capset().
      
           This has been simplified and uses less locking.  The LSM functions it
           calls have been merged.
      
       (5) reparent_to_kthreadd().
      
           This gives the current thread the same credentials as init by simply using
           commit_thread() to point that way.
      
       (6) __sigqueue_alloc() and switch_uid()
      
           __sigqueue_alloc() can't stop the target task from changing its creds
           beneath it, so this function gets a reference to the currently applicable
           user_struct which it then passes into the sigqueue struct it returns if
           successful.
      
           switch_uid() is now called from commit_creds(), and possibly should be
           folded into that.  commit_creds() should take care of protecting
           __sigqueue_alloc().
      
       (7) [sg]et[ug]id() and co and [sg]et_current_groups.
      
           The set functions now all use prepare_creds(), commit_creds() and
           abort_creds() to build and check a new set of credentials before applying
           it.
      
           security_task_set[ug]id() is called inside the prepared section.  This
           guarantees that nothing else will affect the creds until we've finished.
      
           The calling of set_dumpable() has been moved into commit_creds().
      
           Much of the functionality of set_user() has been moved into
           commit_creds().
      
           The get functions all simply access the data directly.
      
       (8) security_task_prctl() and cap_task_prctl().
      
           security_task_prctl() has been modified to return -ENOSYS if it doesn't
           want to handle a function, or otherwise return the return value directly
           rather than through an argument.
      
           Additionally, cap_task_prctl() now prepares a new set of credentials, even
           if it doesn't end up using it.
      
       (9) Keyrings.
      
           A number of changes have been made to the keyrings code:
      
           (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
           	 all been dropped and built in to the credentials functions directly.
           	 They may want separating out again later.
      
           (b) key_alloc() and search_process_keyrings() now take a cred pointer
           	 rather than a task pointer to specify the security context.
      
           (c) copy_creds() gives a new thread within the same thread group a new
           	 thread keyring if its parent had one, otherwise it discards the thread
           	 keyring.
      
           (d) The authorisation key now points directly to the credentials to extend
           	 the search into rather pointing to the task that carries them.
      
           (e) Installing thread, process or session keyrings causes a new set of
           	 credentials to be created, even though it's not strictly necessary for
           	 process or session keyrings (they're shared).
      
      (10) Usermode helper.
      
           The usermode helper code now carries a cred struct pointer in its
           subprocess_info struct instead of a new session keyring pointer.  This set
           of credentials is derived from init_cred and installed on the new process
           after it has been cloned.
      
           call_usermodehelper_setup() allocates the new credentials and
           call_usermodehelper_freeinfo() discards them if they haven't been used.  A
           special cred function (prepare_usermodeinfo_creds()) is provided
           specifically for call_usermodehelper_setup() to call.
      
           call_usermodehelper_setkeys() adjusts the credentials to sport the
           supplied keyring as the new session keyring.
      
      (11) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) selinux_setprocattr() no longer does its check for whether the
           	 current ptracer can access processes with the new SID inside the lock
           	 that covers getting the ptracer's SID.  Whilst this lock ensures that
           	 the check is done with the ptracer pinned, the result is only valid
           	 until the lock is released, so there's no point doing it inside the
           	 lock.
      
      (12) is_single_threaded().
      
           This function has been extracted from selinux_setprocattr() and put into
           a file of its own in the lib/ directory as join_session_keyring() now
           wants to use it too.
      
           The code in SELinux just checked to see whether a task shared mm_structs
           with other tasks (CLONE_VM), but that isn't good enough.  We really want
           to know if they're part of the same thread group (CLONE_THREAD).
      
      (13) nfsd.
      
           The NFS server daemon now has to use the COW credentials to set the
           credentials it is going to use.  It really needs to pass the credentials
           down to the functions it calls, but it can't do that until other patches
           in this series have been applied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d84f4f99
    • S
      Kconfig: SLUB is the default slab allocator · 02f56210
      Simon Arlott 提交于
      In 2007, a0acd820 changed the default
      slab allocator to SLUB, but the SLAB help text still says SLAB is the
      default. This change fixes that.
      Signed-off-by: NSimon Arlott <simon@fire.lp0.eu>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      02f56210
  7. 13 11月, 2008 1 次提交
  8. 12 11月, 2008 2 次提交
    • F
      tracing/fastboot: Use the ring-buffer timestamp for initcall entries · 74239072
      Frederic Weisbecker 提交于
      Impact: Split the boot tracer entries in two parts: call and return
      
      Now that we are using the sched tracer from the boot tracer, we want
      to use the same timestamp than the ring-buffer to have consistent time
      captures between sched events and initcall events.
      
      So we get rid of the old time capture by the boot tracer and split the
      initcall events in two parts: call and return. This way we have the
      ring buffer timestamp of both.
      
      An example trace:
      
      [   27.904149584] calling  net_ns_init+0x0/0x1c0 @ 1
      [   27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs
      [   27.904575926] calling  reboot_init+0x0/0x20 @ 1
      [   27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs
      [   27.904800228] calling  sysctl_init+0x0/0x30 @ 1
      [   27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs
      [   27.905287211] calling  ksysfs_init+0x0/0xb0 @ 1
       ##### CPU 0 buffer started ####
                  init-1     [000]    27.905395:      1:120:R   + [001]    11:115:S
       ##### CPU 1 buffer started ####
                <idle>-0     [001]    27.905425:      0:140:R ==> [001]    11:115:R
                  init-1     [000]    27.905426:      1:120:D ==> [000]     0:140:R
                <idle>-0     [000]    27.905431:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905451:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.905456:      4:115:S ==> [000]     0:140:R
                 udevd-11    [001]    27.905458:     11:115:R   + [001]    14:115:R
                <idle>-0     [000]    27.905459:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905462:      0:140:R ==> [000]     4:115:R
                 udevd-11    [001]    27.905462:     11:115:R ==> [001]    14:115:R
           ksoftirqd/0-4     [000]    27.905467:      4:115:S ==> [000]     0:140:R
                <idle>-0     [000]    27.905470:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905473:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.905476:      4:115:S ==> [000]     0:140:R
                <idle>-0     [000]    27.905479:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905482:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.905486:      4:115:S ==> [000]     0:140:R
                 udevd-14    [001]    27.905499:     14:120:X ==> [001]    11:115:R
                 udevd-11    [001]    27.905506:     11:115:R   + [000]     1:120:D
                <idle>-0     [000]    27.905515:      0:140:R ==> [000]     1:120:R
                 udevd-11    [001]    27.905517:     11:115:S ==> [001]     0:140:R
      [   27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs
      [   27.905705736] calling  init_jiffies_clocksource+0x0/0x10 @ 1
      [   27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs
      [   27.906769814] calling  pm_init+0x0/0x30 @ 1
      [   27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs
      [   27.906997803] calling  pm_disk_init+0x0/0x20 @ 1
      [   27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs
      [   27.907222556] calling  swsusp_header_init+0x0/0x30 @ 1
      [   27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs
      [   27.907439620] calling  stop_machine_init+0x0/0x50 @ 1
                  init-1     [000]    27.907485:      1:120:R   + [000]     2:115:S
                  init-1     [000]    27.907490:      1:120:D ==> [000]     2:115:R
              kthreadd-2     [000]    27.907507:      2:115:R   + [001]    15:115:R
                <idle>-0     [001]    27.907517:      0:140:R ==> [001]    15:115:R
              kthreadd-2     [000]    27.907517:      2:115:D ==> [000]     0:140:R
                <idle>-0     [000]    27.907521:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.907524:      0:140:R ==> [000]     4:115:R
                 udevd-15    [001]    27.907527:     15:115:D   + [000]     2:115:D
           ksoftirqd/0-4     [000]    27.907537:      4:115:S ==> [000]     2:115:R
                 udevd-15    [001]    27.907537:     15:115:D ==> [001]     0:140:R
              kthreadd-2     [000]    27.907546:      2:115:R   + [000]     1:120:D
              kthreadd-2     [000]    27.907550:      2:115:S ==> [000]     1:120:R
                  init-1     [000]    27.907584:      1:120:R   + [000]    15:  0:D
                  init-1     [000]    27.907589:      1:120:R   + [000]     2:115:S
                  init-1     [000]    27.907593:      1:120:D ==> [000]    15:  0:R
                 udevd-15    [000]    27.907601:     15:  0:S ==> [000]     2:115:R
       ##### CPU 0 buffer started ####
              kthreadd-2     [000]    27.907616:      2:115:R   + [001]    16:115:R
       ##### CPU 1 buffer started ####
                <idle>-0     [001]    27.907620:      0:140:R ==> [001]    16:115:R
              kthreadd-2     [000]    27.907621:      2:115:D ==> [000]     0:140:R
                 udevd-16    [001]    27.907625:     16:115:D   + [000]     2:115:D
                <idle>-0     [000]    27.907628:      0:140:R   + [000]     4:115:S
                 udevd-16    [001]    27.907629:     16:115:D ==> [001]     0:140:R
                <idle>-0     [000]    27.907631:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.907636:      4:115:S ==> [000]     2:115:R
              kthreadd-2     [000]    27.907644:      2:115:R   + [000]     1:120:D
              kthreadd-2     [000]    27.907647:      2:115:S ==> [000]     1:120:R
                  init-1     [000]    27.907657:      1:120:R   + [001]    16:  0:D
                <idle>-0     [001]    27.907666:      0:140:R ==> [001]    16:  0:R
      [   27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs
      [   27.907850704] calling  filelock_init+0x0/0x30 @ 1
      [   27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs
      [   27.908071327] calling  init_script_binfmt+0x0/0x10 @ 1
      [   27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs
      [   27.908309461] calling  init_elf_binfmt+0x0/0x10 @ 1
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74239072
    • F
      tracing/fastboot: move boot tracer structs and funcs into their own header. · 3f5ec136
      Frederic Weisbecker 提交于
      Impact: Cleanups on the boot tracer and ftrace
      
      This patch bring some cleanups about the boot tracer headers. The
      functions and structures of this tracer have nothing related to ftrace
      and should have so their own header file.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3f5ec136
  9. 05 11月, 2008 1 次提交
    • F
      tracing/fastboot: Enable boot tracing only during initcalls · 71566a0d
      Frederic Weisbecker 提交于
      Impact: modify boot tracer
      
      We used to disable the initcall tracing at a specified time (IE: end
      of builtin initcalls). But we don't need it anymore. It will be
      stopped when initcalls are finished.
      
      However we want two things:
      
      _Start this tracing only after pre-smp initcalls are finished.
      
      _Since we are planning to trace sched_switches at the same time, we
      want to enable them only during the initcall execution.
      
      For this purpose, this patch introduce two functions to enable/disable
      the sched_switch tracing during boot.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      71566a0d
  10. 02 11月, 2008 1 次提交
  11. 31 10月, 2008 2 次提交
  12. 26 10月, 2008 1 次提交
    • L
      Revert "Call init_workqueues before pre smp initcalls." · 4403b406
      Linus Torvalds 提交于
      This reverts commit a802dd0e by moving
      the call to init_workqueues() back where it belongs - after SMP has been
      initialized.
      
      It also moves stop_machine_init() - which needs workqueues - to a later
      phase using a core_initcall() instead of early_initcall().  That should
      satisfy all ordering requirements, and was apparently the reason why
      init_workqueues() was moved to be too early.
      
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4403b406
  13. 23 10月, 2008 3 次提交
  14. 22 10月, 2008 2 次提交
  15. 21 10月, 2008 1 次提交
    • T
      PCI: allow quirks to be compiled out · 3d137310
      Thomas Petazzoni 提交于
      This patch adds the CONFIG_PCI_QUIRKS option which allows to remove all
      the PCI quirks, which are not necessarily used on embedded systems when
      PCI is working properly. As this is a size-reduction option, it depends
      on CONFIG_EMBEDDED. It allows to save almost 12 kilobytes of kernel
      code:
      
         text	   data	    bss	    dec	    hex	filename
      1287806	 123596	 212992	1624394	 18c94a	vmlinux.old
      1275854	 123596	 212992	1612442	 189a9a	vmlinux
       -11952       0       0  -11952   -2EB0 +/-
      
      This patch has originally been written by Zwane Mwaikambo
      <zwane@arm.linux.org.uk> and is part of the Linux Tiny project.
      Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      3d137310
  16. 20 10月, 2008 2 次提交
    • M
      container freezer: implement freezer cgroup subsystem · dc52ddc0
      Matt Helsley 提交于
      This patch implements a new freezer subsystem in the control groups
      framework.  It provides a way to stop and resume execution of all tasks in
      a cgroup by writing in the cgroup filesystem.
      
      The freezer subsystem in the container filesystem defines a file named
      freezer.state.  Writing "FROZEN" to the state file will freeze all tasks
      in the cgroup.  Subsequently writing "RUNNING" will unfreeze the tasks in
      the cgroup.  Reading will return the current state.
      
      * Examples of usage :
      
         # mkdir /containers/freezer
         # mount -t cgroup -ofreezer freezer  /containers
         # mkdir /containers/0
         # echo $some_pid > /containers/0/tasks
      
      to get status of the freezer subsystem :
      
         # cat /containers/0/freezer.state
         RUNNING
      
      to freeze all tasks in the container :
      
         # echo FROZEN > /containers/0/freezer.state
         # cat /containers/0/freezer.state
         FREEZING
         # cat /containers/0/freezer.state
         FROZEN
      
      to unfreeze all tasks in the container :
      
         # echo RUNNING > /containers/0/freezer.state
         # cat /containers/0/freezer.state
         RUNNING
      
      This is the basic mechanism which should do the right thing for user space
      task in a simple scenario.
      
      It's important to note that freezing can be incomplete.  In that case we
      return EBUSY.  This means that some tasks in the cgroup are busy doing
      something that prevents us from completely freezing the cgroup at this
      time.  After EBUSY, the cgroup will remain partially frozen -- reflected
      by freezer.state reporting "FREEZING" when read.  The state will remain
      "FREEZING" until one of these things happens:
      
      	1) Userspace cancels the freezing operation by writing "RUNNING" to
      		the freezer.state file
      	2) Userspace retries the freezing operation by writing "FROZEN" to
      		the freezer.state file (writing "FREEZING" is not legal
      		and returns EIO)
      	3) The tasks that blocked the cgroup from entering the "FROZEN"
      		state disappear from the cgroup's set of tasks.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: export thaw_process]
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Tested-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc52ddc0
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  17. 17 10月, 2008 4 次提交
    • A
      init/do_mounts_md.c must #include <linux/delay.h> · 73b4a24f
      Adrian Bunk 提交于
      This patch fixes the following compile error caused by commit
      589f800b ("fastboot: make the raid
      autodetect code wait for all devices to init"):
      
          CC      init/do_mounts_md.o
        init/do_mounts_md.c: In function 'autodetect_raid':
        init/do_mounts_md.c:285: error: implicit declaration of function 'msleep'
        make[2]: *** [init/do_mounts_md.o] Error 1
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b4a24f
    • T
      Configure out AIO support · ebf3f09c
      Thomas Petazzoni 提交于
      This patchs adds the CONFIG_AIO option which allows to remove support
      for asynchronous I/O operations, that are not necessarly used by
      applications, particularly on embedded devices. As this is a
      size-reduction option, it depends on CONFIG_EMBEDDED. It allows to
      save ~7 kilobytes of kernel code/data:
      
         text	   data	    bss	    dec	    hex	filename
      1115067	 119180	 217088	1451335	 162547	vmlinux
      1108025	 119048	 217088	1444161	 160941	vmlinux.new
        -7042    -132       0   -7174   -1C06 +/-
      
      This patch has been originally written by Matt Mackall
      <mpm@selenic.com>, and is part of the Linux Tiny project.
      
      [randy.dunlap@oracle.com: build fix]
      Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Cc: Benjamin LaHaise <bcrl@kvack.org>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebf3f09c
    • N
      initramfs: add option to preserve mtime from initramfs cpio images · 889d51a1
      Nye Liu 提交于
      When unpacking the cpio into the initramfs, mtimes are not preserved by
      default.  This patch adds an INITRAMFS_PRESERVE_MTIME option that allows
      mtimes stored in the cpio image to be used when constructing the
      initramfs.
      
      For embedded applications that run exclusively out of the initramfs, this
      is invaluable:
      
      When building embedded application initramfs images, its nice to know when
      the files were actually created during the build process - that makes it
      easier to see what files were modified when so we can compare the files
      that are being used on the image with the files used during the build
      process.  This might help (for example) to determine if the target system
      has all the updated files you expect to see w/o having to check MD5s etc.
      
      In our environment, the whole system runs off the initramfs partition, and
      seeing the modified times of the shared libraries (for example) helps us
      find bugs that may have been introduced by the build system incorrectly
      propogating outdated shared libraries into the image.
      
      Similarly, many of the initializion/configuration files in /etc might be
      dynamically built by the build system, and knowing when they were modified
      helps us sanity check whether the target system has the "latest" files
      etc.
      
      Finally, we might use last modified times to determine whether a hot fix
      should be applied or not to the running ramfs.
      Signed-off-by: NNye Liu <nyet@nyet.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      889d51a1
    • G
      identify_ramdisk_image(): correct typo about return value in comment · 93fd85d0
      Geert Uytterhoeven 提交于
      identify_ramdisk_image() returns 0 (not -1) if a gzipped ramdisk is found:
      
      	if (buf[0] == 037 && ((buf[1] == 0213) || (buf[1] == 0236))) {
      		printk(KERN_NOTICE
      		       "RAMDISK: Compressed image found at block %d\n",
      		       start_block);
      		nblocks = 0;
      		^^^^^^^^^^^
      		goto done;
      	}
      
      	...
      
      done:
      	sys_lseek(fd, start_block * BLOCK_SIZE, 0);
      	kfree(buf);
      	return nblocks;
      	^^^^^^^^^^^^^^
      
      Hence correct the typo in the comment, which has existed since the
      addition of compressed ramdisk support in 1.3.48.
      Signed-off-by: NGeert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93fd85d0
  18. 16 10月, 2008 10 次提交
  19. 14 10月, 2008 3 次提交