1. 26 2月, 2009 1 次提交
    • P
      rcu: Teach RCU that idle task is not quiscent state at boot · a6826048
      Paul E. McKenney 提交于
      This patch fixes a bug located by Vegard Nossum with the aid of
      kmemcheck, updated based on review comments from Nick Piggin,
      Ingo Molnar, and Andrew Morton.  And cleans up the variable-name
      and function-name language.  ;-)
      
      The boot CPU runs in the context of its idle thread during boot-up.
      During this time, idle_cpu(0) will always return nonzero, which will
      fool Classic and Hierarchical RCU into deciding that a large chunk of
      the boot-up sequence is a big long quiescent state.  This in turn causes
      RCU to prematurely end grace periods during this time.
      
      This patch changes the rcutree.c and rcuclassic.c rcu_check_callbacks()
      function to ignore the idle task as a quiescent state until the
      system has started up the scheduler in rest_init(), introducing a
      new non-API function rcu_idle_now_means_idle() to inform RCU of this
      transition.  RCU maintains an internal rcu_idle_cpu_truthful variable
      to track this state, which is then used by rcu_check_callback() to
      determine if it should believe idle_cpu().
      
      Because this patch has the effect of disallowing RCU grace periods
      during long stretches of the boot-up sequence, this patch also introduces
      Josh Triplett's UP-only optimization that makes synchronize_rcu() be a
      no-op if num_online_cpus() returns 1.  This allows boot-time code that
      calls synchronize_rcu() to proceed normally.  Note, however, that RCU
      callbacks registered by call_rcu() will likely queue up until later in
      the boot sequence.  Although rcuclassic and rcutree can also use this
      same optimization after boot completes, rcupreempt must restrict its
      use of this optimization to the portion of the boot sequence before the
      scheduler starts up, given that an rcupreempt RCU read-side critical
      section may be preeempted.
      
      In addition, this patch takes Nick Piggin's suggestion to make the
      system_state global variable be __read_mostly.
      
      Changes since v4:
      
      o	Changes the name of the introduced function and variable to
      	be less emotional.  ;-)
      
      Changes since v3:
      
      o	WARN_ON(nr_context_switches() > 0) to verify that RCU
      	switches out of boot-time mode before the first context
      	switch, as suggested by Nick Piggin.
      
      Changes since v2:
      
      o	Created rcu_blocking_is_gp() internal-to-RCU API that
      	determines whether a call to synchronize_rcu() is itself
      	a grace period.
      
      o	The definition of rcu_blocking_is_gp() for rcuclassic and
      	rcutree checks to see if but a single CPU is online.
      
      o	The definition of rcu_blocking_is_gp() for rcupreempt
      	checks to see both if but a single CPU is online and if
      	the system is still in early boot.
      
      	This allows rcupreempt to again work correctly if running
      	on a single CPU after booting is complete.
      
      o	Added check to rcupreempt's synchronize_sched() for there
      	being but one online CPU.
      
      Tested all three variants both SMP and !SMP, booted fine, passed a short
      rcutorture test on both x86 and Power.
      Located-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NVegard Nossum <vegard.nossum@gmail.com>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a6826048
  2. 08 1月, 2009 1 次提交
    • A
      async: Asynchronous function calls to speed up kernel boot · 22a9d645
      Arjan van de Ven 提交于
      Right now, most of the kernel boot is strictly synchronous, such that
      various hardware delays are done sequentially.
      
      In order to make the kernel boot faster, this patch introduces
      infrastructure to allow doing some of the initialization steps
      asynchronously, which will hide significant portions of the hardware delays
      in practice.
      
      In order to not change device order and other similar observables, this
      patch does NOT do full parallel initialization.
      
      Rather, it operates more in the way an out of order CPU does; the work may
      be done out of order and asynchronous, but the observable effects
      (instruction retiring for the CPU) are still done in the original sequence.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      22a9d645
  3. 07 1月, 2009 3 次提交
  4. 06 1月, 2009 1 次提交
  5. 03 1月, 2009 1 次提交
  6. 01 1月, 2009 2 次提交
  7. 29 12月, 2008 1 次提交
  8. 27 12月, 2008 1 次提交
  9. 08 12月, 2008 1 次提交
    • Y
      sparse irq_desc[] array: core kernel and x86 changes · 0b8f1efa
      Yinghai Lu 提交于
      Impact: new feature
      
      Problem on distro kernels: irq_desc[NR_IRQS] takes megabytes of RAM with
      NR_CPUS set to large values. The goal is to be able to scale up to much
      larger NR_IRQS value without impacting the (important) common case.
      
      To solve this, we generalize irq_desc[NR_IRQS] to an (optional) array of
      irq_desc pointers.
      
      When CONFIG_SPARSE_IRQ=y is used, we use kzalloc_node to get irq_desc,
      this also makes the IRQ descriptors NUMA-local (to the site that calls
      request_irq()).
      
      This gets rid of the irq_cfg[] static array on x86 as well: irq_cfg now
      uses desc->chip_data for x86 to store irq_cfg.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0b8f1efa
  10. 23 11月, 2008 1 次提交
  11. 14 11月, 2008 1 次提交
    • D
      CRED: Inaugurate COW credentials · d84f4f99
      David Howells 提交于
      Inaugurate copy-on-write credentials management.  This uses RCU to manage the
      credentials pointer in the task_struct with respect to accesses by other tasks.
      A process may only modify its own credentials, and so does not need locking to
      access or modify its own credentials.
      
      A mutex (cred_replace_mutex) is added to the task_struct to control the effect
      of PTRACE_ATTACHED on credential calculations, particularly with respect to
      execve().
      
      With this patch, the contents of an active credentials struct may not be
      changed directly; rather a new set of credentials must be prepared, modified
      and committed using something like the following sequence of events:
      
      	struct cred *new = prepare_creds();
      	int ret = blah(new);
      	if (ret < 0) {
      		abort_creds(new);
      		return ret;
      	}
      	return commit_creds(new);
      
      There are some exceptions to this rule: the keyrings pointed to by the active
      credentials may be instantiated - keyrings violate the COW rule as managing
      COW keyrings is tricky, given that it is possible for a task to directly alter
      the keys in a keyring in use by another task.
      
      To help enforce this, various pointers to sets of credentials, such as those in
      the task_struct, are declared const.  The purpose of this is compile-time
      discouragement of altering credentials through those pointers.  Once a set of
      credentials has been made public through one of these pointers, it may not be
      modified, except under special circumstances:
      
        (1) Its reference count may incremented and decremented.
      
        (2) The keyrings to which it points may be modified, but not replaced.
      
      The only safe way to modify anything else is to create a replacement and commit
      using the functions described in Documentation/credentials.txt (which will be
      added by a later patch).
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           This now prepares and commits credentials in various places in the
           security code rather than altering the current creds directly.
      
       (2) Temporary credential overrides.
      
           do_coredump() and sys_faccessat() now prepare their own credentials and
           temporarily override the ones currently on the acting thread, whilst
           preventing interference from other threads by holding cred_replace_mutex
           on the thread being dumped.
      
           This will be replaced in a future patch by something that hands down the
           credentials directly to the functions being called, rather than altering
           the task's objective credentials.
      
       (3) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_capset_check(), ->capset_check()
           (*) security_capset_set(), ->capset_set()
      
           	 Removed in favour of security_capset().
      
           (*) security_capset(), ->capset()
      
           	 New.  This is passed a pointer to the new creds, a pointer to the old
           	 creds and the proposed capability sets.  It should fill in the new
           	 creds or return an error.  All pointers, barring the pointer to the
           	 new creds, are now const.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
      
           	 Changed; now returns a value, which will cause the process to be
           	 killed if it's an error.
      
           (*) security_task_alloc(), ->task_alloc_security()
      
           	 Removed in favour of security_prepare_creds().
      
           (*) security_cred_free(), ->cred_free()
      
           	 New.  Free security data attached to cred->security.
      
           (*) security_prepare_creds(), ->cred_prepare()
      
           	 New. Duplicate any security data attached to cred->security.
      
           (*) security_commit_creds(), ->cred_commit()
      
           	 New. Apply any security effects for the upcoming installation of new
           	 security by commit_creds().
      
           (*) security_task_post_setuid(), ->task_post_setuid()
      
           	 Removed in favour of security_task_fix_setuid().
      
           (*) security_task_fix_setuid(), ->task_fix_setuid()
      
           	 Fix up the proposed new credentials for setuid().  This is used by
           	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
           	 setuid() changes.  Changes are made to the new credentials, rather
           	 than the task itself as in security_task_post_setuid().
      
           (*) security_task_reparent_to_init(), ->task_reparent_to_init()
      
           	 Removed.  Instead the task being reparented to init is referred
           	 directly to init's credentials.
      
      	 NOTE!  This results in the loss of some state: SELinux's osid no
      	 longer records the sid of the thread that forked it.
      
           (*) security_key_alloc(), ->key_alloc()
           (*) security_key_permission(), ->key_permission()
      
           	 Changed.  These now take cred pointers rather than task pointers to
           	 refer to the security context.
      
       (4) sys_capset().
      
           This has been simplified and uses less locking.  The LSM functions it
           calls have been merged.
      
       (5) reparent_to_kthreadd().
      
           This gives the current thread the same credentials as init by simply using
           commit_thread() to point that way.
      
       (6) __sigqueue_alloc() and switch_uid()
      
           __sigqueue_alloc() can't stop the target task from changing its creds
           beneath it, so this function gets a reference to the currently applicable
           user_struct which it then passes into the sigqueue struct it returns if
           successful.
      
           switch_uid() is now called from commit_creds(), and possibly should be
           folded into that.  commit_creds() should take care of protecting
           __sigqueue_alloc().
      
       (7) [sg]et[ug]id() and co and [sg]et_current_groups.
      
           The set functions now all use prepare_creds(), commit_creds() and
           abort_creds() to build and check a new set of credentials before applying
           it.
      
           security_task_set[ug]id() is called inside the prepared section.  This
           guarantees that nothing else will affect the creds until we've finished.
      
           The calling of set_dumpable() has been moved into commit_creds().
      
           Much of the functionality of set_user() has been moved into
           commit_creds().
      
           The get functions all simply access the data directly.
      
       (8) security_task_prctl() and cap_task_prctl().
      
           security_task_prctl() has been modified to return -ENOSYS if it doesn't
           want to handle a function, or otherwise return the return value directly
           rather than through an argument.
      
           Additionally, cap_task_prctl() now prepares a new set of credentials, even
           if it doesn't end up using it.
      
       (9) Keyrings.
      
           A number of changes have been made to the keyrings code:
      
           (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
           	 all been dropped and built in to the credentials functions directly.
           	 They may want separating out again later.
      
           (b) key_alloc() and search_process_keyrings() now take a cred pointer
           	 rather than a task pointer to specify the security context.
      
           (c) copy_creds() gives a new thread within the same thread group a new
           	 thread keyring if its parent had one, otherwise it discards the thread
           	 keyring.
      
           (d) The authorisation key now points directly to the credentials to extend
           	 the search into rather pointing to the task that carries them.
      
           (e) Installing thread, process or session keyrings causes a new set of
           	 credentials to be created, even though it's not strictly necessary for
           	 process or session keyrings (they're shared).
      
      (10) Usermode helper.
      
           The usermode helper code now carries a cred struct pointer in its
           subprocess_info struct instead of a new session keyring pointer.  This set
           of credentials is derived from init_cred and installed on the new process
           after it has been cloned.
      
           call_usermodehelper_setup() allocates the new credentials and
           call_usermodehelper_freeinfo() discards them if they haven't been used.  A
           special cred function (prepare_usermodeinfo_creds()) is provided
           specifically for call_usermodehelper_setup() to call.
      
           call_usermodehelper_setkeys() adjusts the credentials to sport the
           supplied keyring as the new session keyring.
      
      (11) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) selinux_setprocattr() no longer does its check for whether the
           	 current ptracer can access processes with the new SID inside the lock
           	 that covers getting the ptracer's SID.  Whilst this lock ensures that
           	 the check is done with the ptracer pinned, the result is only valid
           	 until the lock is released, so there's no point doing it inside the
           	 lock.
      
      (12) is_single_threaded().
      
           This function has been extracted from selinux_setprocattr() and put into
           a file of its own in the lib/ directory as join_session_keyring() now
           wants to use it too.
      
           The code in SELinux just checked to see whether a task shared mm_structs
           with other tasks (CLONE_VM), but that isn't good enough.  We really want
           to know if they're part of the same thread group (CLONE_THREAD).
      
      (13) nfsd.
      
           The NFS server daemon now has to use the COW credentials to set the
           credentials it is going to use.  It really needs to pass the credentials
           down to the functions it calls, but it can't do that until other patches
           in this series have been applied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d84f4f99
  12. 12 11月, 2008 2 次提交
    • F
      tracing/fastboot: Use the ring-buffer timestamp for initcall entries · 74239072
      Frederic Weisbecker 提交于
      Impact: Split the boot tracer entries in two parts: call and return
      
      Now that we are using the sched tracer from the boot tracer, we want
      to use the same timestamp than the ring-buffer to have consistent time
      captures between sched events and initcall events.
      
      So we get rid of the old time capture by the boot tracer and split the
      initcall events in two parts: call and return. This way we have the
      ring buffer timestamp of both.
      
      An example trace:
      
      [   27.904149584] calling  net_ns_init+0x0/0x1c0 @ 1
      [   27.904429624] initcall net_ns_init+0x0/0x1c0 returned 0 after 0 msecs
      [   27.904575926] calling  reboot_init+0x0/0x20 @ 1
      [   27.904655399] initcall reboot_init+0x0/0x20 returned 0 after 0 msecs
      [   27.904800228] calling  sysctl_init+0x0/0x30 @ 1
      [   27.905142914] initcall sysctl_init+0x0/0x30 returned 0 after 0 msecs
      [   27.905287211] calling  ksysfs_init+0x0/0xb0 @ 1
       ##### CPU 0 buffer started ####
                  init-1     [000]    27.905395:      1:120:R   + [001]    11:115:S
       ##### CPU 1 buffer started ####
                <idle>-0     [001]    27.905425:      0:140:R ==> [001]    11:115:R
                  init-1     [000]    27.905426:      1:120:D ==> [000]     0:140:R
                <idle>-0     [000]    27.905431:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905451:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.905456:      4:115:S ==> [000]     0:140:R
                 udevd-11    [001]    27.905458:     11:115:R   + [001]    14:115:R
                <idle>-0     [000]    27.905459:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905462:      0:140:R ==> [000]     4:115:R
                 udevd-11    [001]    27.905462:     11:115:R ==> [001]    14:115:R
           ksoftirqd/0-4     [000]    27.905467:      4:115:S ==> [000]     0:140:R
                <idle>-0     [000]    27.905470:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905473:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.905476:      4:115:S ==> [000]     0:140:R
                <idle>-0     [000]    27.905479:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.905482:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.905486:      4:115:S ==> [000]     0:140:R
                 udevd-14    [001]    27.905499:     14:120:X ==> [001]    11:115:R
                 udevd-11    [001]    27.905506:     11:115:R   + [000]     1:120:D
                <idle>-0     [000]    27.905515:      0:140:R ==> [000]     1:120:R
                 udevd-11    [001]    27.905517:     11:115:S ==> [001]     0:140:R
      [   27.905557107] initcall ksysfs_init+0x0/0xb0 returned 0 after 3906 msecs
      [   27.905705736] calling  init_jiffies_clocksource+0x0/0x10 @ 1
      [   27.905779239] initcall init_jiffies_clocksource+0x0/0x10 returned 0 after 0 msecs
      [   27.906769814] calling  pm_init+0x0/0x30 @ 1
      [   27.906853627] initcall pm_init+0x0/0x30 returned 0 after 0 msecs
      [   27.906997803] calling  pm_disk_init+0x0/0x20 @ 1
      [   27.907076946] initcall pm_disk_init+0x0/0x20 returned 0 after 0 msecs
      [   27.907222556] calling  swsusp_header_init+0x0/0x30 @ 1
      [   27.907294325] initcall swsusp_header_init+0x0/0x30 returned 0 after 0 msecs
      [   27.907439620] calling  stop_machine_init+0x0/0x50 @ 1
                  init-1     [000]    27.907485:      1:120:R   + [000]     2:115:S
                  init-1     [000]    27.907490:      1:120:D ==> [000]     2:115:R
              kthreadd-2     [000]    27.907507:      2:115:R   + [001]    15:115:R
                <idle>-0     [001]    27.907517:      0:140:R ==> [001]    15:115:R
              kthreadd-2     [000]    27.907517:      2:115:D ==> [000]     0:140:R
                <idle>-0     [000]    27.907521:      0:140:R   + [000]     4:115:S
                <idle>-0     [000]    27.907524:      0:140:R ==> [000]     4:115:R
                 udevd-15    [001]    27.907527:     15:115:D   + [000]     2:115:D
           ksoftirqd/0-4     [000]    27.907537:      4:115:S ==> [000]     2:115:R
                 udevd-15    [001]    27.907537:     15:115:D ==> [001]     0:140:R
              kthreadd-2     [000]    27.907546:      2:115:R   + [000]     1:120:D
              kthreadd-2     [000]    27.907550:      2:115:S ==> [000]     1:120:R
                  init-1     [000]    27.907584:      1:120:R   + [000]    15:  0:D
                  init-1     [000]    27.907589:      1:120:R   + [000]     2:115:S
                  init-1     [000]    27.907593:      1:120:D ==> [000]    15:  0:R
                 udevd-15    [000]    27.907601:     15:  0:S ==> [000]     2:115:R
       ##### CPU 0 buffer started ####
              kthreadd-2     [000]    27.907616:      2:115:R   + [001]    16:115:R
       ##### CPU 1 buffer started ####
                <idle>-0     [001]    27.907620:      0:140:R ==> [001]    16:115:R
              kthreadd-2     [000]    27.907621:      2:115:D ==> [000]     0:140:R
                 udevd-16    [001]    27.907625:     16:115:D   + [000]     2:115:D
                <idle>-0     [000]    27.907628:      0:140:R   + [000]     4:115:S
                 udevd-16    [001]    27.907629:     16:115:D ==> [001]     0:140:R
                <idle>-0     [000]    27.907631:      0:140:R ==> [000]     4:115:R
           ksoftirqd/0-4     [000]    27.907636:      4:115:S ==> [000]     2:115:R
              kthreadd-2     [000]    27.907644:      2:115:R   + [000]     1:120:D
              kthreadd-2     [000]    27.907647:      2:115:S ==> [000]     1:120:R
                  init-1     [000]    27.907657:      1:120:R   + [001]    16:  0:D
                <idle>-0     [001]    27.907666:      0:140:R ==> [001]    16:  0:R
      [   27.907703862] initcall stop_machine_init+0x0/0x50 returned 0 after 0 msecs
      [   27.907850704] calling  filelock_init+0x0/0x30 @ 1
      [   27.907926573] initcall filelock_init+0x0/0x30 returned 0 after 0 msecs
      [   27.908071327] calling  init_script_binfmt+0x0/0x10 @ 1
      [   27.908165195] initcall init_script_binfmt+0x0/0x10 returned 0 after 0 msecs
      [   27.908309461] calling  init_elf_binfmt+0x0/0x10 @ 1
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74239072
    • F
      tracing/fastboot: move boot tracer structs and funcs into their own header. · 3f5ec136
      Frederic Weisbecker 提交于
      Impact: Cleanups on the boot tracer and ftrace
      
      This patch bring some cleanups about the boot tracer headers. The
      functions and structures of this tracer have nothing related to ftrace
      and should have so their own header file.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3f5ec136
  13. 05 11月, 2008 1 次提交
    • F
      tracing/fastboot: Enable boot tracing only during initcalls · 71566a0d
      Frederic Weisbecker 提交于
      Impact: modify boot tracer
      
      We used to disable the initcall tracing at a specified time (IE: end
      of builtin initcalls). But we don't need it anymore. It will be
      stopped when initcalls are finished.
      
      However we want two things:
      
      _Start this tracing only after pre-smp initcalls are finished.
      
      _Since we are planning to trace sched_switches at the same time, we
      want to enable them only during the initcall execution.
      
      For this purpose, this patch introduce two functions to enable/disable
      the sched_switch tracing during boot.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      71566a0d
  14. 26 10月, 2008 1 次提交
    • L
      Revert "Call init_workqueues before pre smp initcalls." · 4403b406
      Linus Torvalds 提交于
      This reverts commit a802dd0e by moving
      the call to init_workqueues() back where it belongs - after SMP has been
      initialized.
      
      It also moves stop_machine_init() - which needs workqueues - to a later
      phase using a core_initcall() instead of early_initcall().  That should
      satisfy all ordering requirements, and was apparently the reason why
      init_workqueues() was moved to be too early.
      
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4403b406
  15. 23 10月, 2008 2 次提交
  16. 22 10月, 2008 2 次提交
  17. 20 10月, 2008 1 次提交
    • N
      mm: rewrite vmap layer · db64fe02
      Nick Piggin 提交于
      Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
      provide a fast, scalable percpu frontend for small vmaps (requires a
      slightly different API, though).
      
      The biggest problem with vmap is actually vunmap.  Presently this requires
      a global kernel TLB flush, which on most architectures is a broadcast IPI
      to all CPUs to flush the cache.  This is all done under a global lock.  As
      the number of CPUs increases, so will the number of vunmaps a scaled
      workload will want to perform, and so will the cost of a global TLB flush.
       This gives terrible quadratic scalability characteristics.
      
      Another problem is that the entire vmap subsystem works under a single
      lock.  It is a rwlock, but it is actually taken for write in all the fast
      paths, and the read locking would likely never be run concurrently anyway,
      so it's just pointless.
      
      This is a rewrite of vmap subsystem to solve those problems.  The existing
      vmalloc API is implemented on top of the rewritten subsystem.
      
      The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
      addresses do not have to be flushed immediately when they are vunmapped,
      because the kernel will not reuse them again (would be a use-after-free)
      until they are reallocated.  So the addresses aren't allocated again until
      a subsequent TLB flush.  A single TLB flush then can flush multiple
      vunmaps from each CPU.
      
      XEN and PAT and such do not like deferred TLB flushing because they can't
      always handle multiple aliasing virtual addresses to a physical address.
      They now call vm_unmap_aliases() in order to flush any deferred mappings.
      That call is very expensive (well, actually not a lot more expensive than
      a single vunmap under the old scheme), however it should be OK if not
      called too often.
      
      The virtual memory extent information is stored in an rbtree rather than a
      linked list to improve the algorithmic scalability.
      
      There is a per-CPU allocator for small vmaps, which amortizes or avoids
      global locking.
      
      To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
      must be used in place of vmap and vunmap.  Vmalloc does not use these
      interfaces at the moment, so it will not be quite so scalable (although it
      will use lazy TLB flushing).
      
      As a quick test of performance, I ran a test that loops in the kernel,
      linearly mapping then touching then unmapping 4 pages.  Different numbers
      of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
      in nanoseconds per map+touch+unmap.
      
      threads           vanilla         vmap rewrite
      1                 14700           2900
      2                 33600           3000
      4                 49500           2800
      8                 70631           2900
      
      So with a 8 cores, the rewritten version is already 25x faster.
      
      In a slightly more realistic test (although with an older and less
      scalable version of the patch), I ripped the not-very-good vunmap batching
      code out of XFS, and implemented the large buffer mapping with vm_map_ram
      and vm_unmap_ram...  along with a couple of other tricks, I was able to
      speed up a large directory workload by 20x on a 64 CPU system.  I believe
      vmap/vunmap is actually sped up a lot more than 20x on such a system, but
      I'm running into other locks now.  vmap is pretty well blown off the
      profiles.
      
      Before:
      1352059 total                                      0.1401
      798784 _write_lock                              8320.6667 <- vmlist_lock
      529313 default_idle                             1181.5022
       15242 smp_call_function                         15.8771  <- vmap tlb flushing
        2472 __get_vm_area_node                         1.9312  <- vmap
        1762 remove_vm_area                             4.5885  <- vunmap
         316 map_vm_area                                0.2297  <- vmap
         312 kfree                                      0.1950
         300 _spin_lock                                 3.1250
         252 sn_send_IPI_phys                           0.4375  <- tlb flushing
         238 vmap                                       0.8264  <- vmap
         216 find_lock_page                             0.5192
         196 find_next_bit                              0.3603
         136 sn2_send_IPI                               0.2024
         130 pio_phys_write_mmr                         2.0312
         118 unmap_kernel_range                         0.1229
      
      After:
       78406 total                                      0.0081
       40053 default_idle                              89.4040
       33576 ia64_spinlock_contention                 349.7500
        1650 _spin_lock                                17.1875
         319 __reg_op                                   0.5538
         281 _atomic_dec_and_lock                       1.0977
         153 mutex_unlock                               1.5938
         123 iget_locked                                0.1671
         117 xfs_dir_lookup                             0.1662
         117 dput                                       0.1406
         114 xfs_iget_core                              0.0268
          92 xfs_da_hashname                            0.1917
          75 d_alloc                                    0.0670
          68 vmap_page_range                            0.0462 <- vmap
          58 kmem_cache_alloc                           0.0604
          57 memset                                     0.0540
          52 rb_next                                    0.1625
          50 __copy_user                                0.0208
          49 bitmap_find_free_region                    0.2188 <- vmap
          46 ia64_sn_udelay                             0.1106
          45 find_inode_fast                            0.1406
          42 memcmp                                     0.2188
          42 finish_task_switch                         0.1094
          42 __d_lookup                                 0.0410
          40 radix_tree_lookup_slot                     0.1250
          37 _spin_unlock_irqrestore                    0.3854
          36 xfs_bmapi                                  0.0050
          36 kmem_cache_free                            0.0256
          35 xfs_vn_getattr                             0.0322
          34 radix_tree_lookup                          0.1062
          33 __link_path_walk                           0.0035
          31 xfs_da_do_buf                              0.0091
          30 _xfs_buf_find                              0.0204
          28 find_get_page                              0.0875
          27 xfs_iread                                  0.0241
          27 __strncpy_from_user                        0.2812
          26 _xfs_buf_initialize                        0.0406
          24 _xfs_buf_lookup_pages                      0.0179
          24 vunmap_page_range                          0.0250 <- vunmap
          23 find_lock_page                             0.0799
          22 vm_map_ram                                 0.0087 <- vmap
          20 kfree                                      0.0125
          19 put_page                                   0.0330
          18 __kmalloc                                  0.0176
          17 xfs_da_node_lookup_int                     0.0086
          17 _read_lock                                 0.0885
          17 page_waitqueue                             0.0664
      
      vmap has gone from being the top 5 on the profiles and flushing the crap
      out of all TLBs, to using less than 1% of kernel time.
      
      [akpm@linux-foundation.org: cleanups, section fix]
      [akpm@linux-foundation.org: fix build on alpha]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db64fe02
  18. 16 10月, 2008 6 次提交
  19. 14 10月, 2008 7 次提交
  20. 12 10月, 2008 1 次提交
    • A
      Add a script to visualize the kernel boot process / time · f9b9796a
      Arjan van de Ven 提交于
      When optimizing the kernel boot time, it's very valuable to visualize
      what is going on at which time. In addition, with some of the initializing
      going asynchronous soon, it's valuable to track/print which worker thread
      is executing the initialization.
      
      This patch adds a script to turn a dmesg into a SVG graph (that can be
      shown with tools such as InkScape, Gimp or Firefox) and a small change
      to the initcall code to print the PID of the thread calling the initcall
      (so that the script can work out the parallelism).
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      f9b9796a
  21. 04 10月, 2008 1 次提交
  22. 12 8月, 2008 1 次提交
    • A
      modules: extend initcall_debug functionality to the module loader · 59f9415f
      Arjan van de Ven 提交于
      The kernel has this really nice facility where if you put "initcall_debug"
      on the kernel commandline, it'll print which function it's going to
      execute just before calling an initcall, and then after the call completes
      it will
      
      1) print if it had an error code
      
      2) checks for a few simple bugs (like leaving irqs off)
      and
      
      3) print how long the init call took in milliseconds.
      
      While trying to optimize the boot speed of my laptop, I have been loving
      number 3 to figure out what to optimize...  ...  and then I wished that
      the same thing was done for module loading.
      
      This patch makes the module loader use this exact same functionality; it's
      a logical extension in my view (since modules are just sort of late
      binding initcalls anyway) and so far I've found it quite useful in finding
      where things are too slow in my boot.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      59f9415f
  23. 06 8月, 2008 1 次提交