1. 16 1月, 2009 1 次提交
  2. 11 1月, 2009 1 次提交
  3. 08 1月, 2009 1 次提交
    • A
      async: Asynchronous function calls to speed up kernel boot · 22a9d645
      Arjan van de Ven 提交于
      Right now, most of the kernel boot is strictly synchronous, such that
      various hardware delays are done sequentially.
      
      In order to make the kernel boot faster, this patch introduces
      infrastructure to allow doing some of the initialization steps
      asynchronously, which will hide significant portions of the hardware delays
      in practice.
      
      In order to not change device order and other similar observables, this
      patch does NOT do full parallel initialization.
      
      Rather, it operates more in the way an out of order CPU does; the work may
      be done out of order and asynchronous, but the observable effects
      (instruction retiring for the CPU) are still done in the original sequence.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      22a9d645
  4. 19 12月, 2008 1 次提交
    • P
      "Tree RCU": scalable classic RCU implementation · 64db4cff
      Paul E. McKenney 提交于
      This patch fixes a long-standing performance bug in classic RCU that
      results in massive internal-to-RCU lock contention on systems with
      more than a few hundred CPUs.  Although this patch creates a separate
      flavor of RCU for ease of review and patch maintenance, it is intended
      to replace classic RCU.
      
      This patch still handles stress better than does mainline, so I am still
      calling it ready for inclusion.  This patch is against the -tip tree.
      Nevertheless, experience on an actual 1000+ CPU machine would still be
      most welcome.
      
      Most of the changes noted below were found while creating an rcutiny
      (which should permit ejecting the current rcuclassic) and while doing
      detailed line-by-line documentation.
      
      Updates from v9 (http://lkml.org/lkml/2008/12/2/334):
      
      o	Fixes from remainder of line-by-line code walkthrough,
      	including comment spelling, initialization, undesirable
      	narrowing due to type conversion, removing redundant memory
      	barriers, removing redundant local-variable initialization,
      	and removing redundant local variables.
      
      	I do not believe that any of these fixes address the CPU-hotplug
      	issues that Andi Kleen was seeing, but please do give it a whirl
      	in case the machine is smarter than I am.
      
      	A writeup from the walkthrough may be found at the following
      	URL, in case you are suffering from terminal insomnia or
      	masochism:
      
      	http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf
      
      o	Made rcutree tracing use seq_file, as suggested some time
      	ago by Lai Jiangshan.
      
      o	Added a .csv variant of the rcudata debugfs trace file, to allow
      	people having thousands of CPUs to drop the data into
      	a spreadsheet.	Tested with oocalc and gnumeric.  Updated
      	documentation to suit.
      
      Updates from v8 (http://lkml.org/lkml/2008/11/15/139):
      
      o	Fix a theoretical race between grace-period initialization and
      	force_quiescent_state() that could occur if more than three
      	jiffies were required to carry out the grace-period
      	initialization.  Which it might, if you had enough CPUs.
      
      o	Apply Ingo's printk-standardization patch.
      
      o	Substitute local variables for repeated accesses to global
      	variables.
      
      o	Fix comment misspellings and redundant (but harmless) increments
      	of ->n_rcu_pending (this latter after having explicitly added it).
      
      o	Apply checkpatch fixes.
      
      Updates from v7 (http://lkml.org/lkml/2008/10/10/291):
      
      o	Fixed a number of problems noted by Gautham Shenoy, including
      	the cpu-stall-detection bug that he was having difficulty
      	convincing me was real.  ;-)
      
      o	Changed cpu-stall detection to wait for ten seconds rather than
      	three in order to reduce false positive, as suggested by Ingo
      	Molnar.
      
      o	Produced a design document (http://lwn.net/Articles/305782/).
      	The act of writing this document uncovered a number of both
      	theoretical and "here and now" bugs as noted below.
      
      o	Fix dynticks_nesting accounting confusion, simplify WARN_ON()
      	condition, fix kerneldoc comments, and add memory barriers
      	in dynticks interface functions.
      
      o	Add more data to tracing.
      
      o	Remove unused "rcu_barrier" field from rcu_data structure.
      
      o	Count calls to rcu_pending() from scheduling-clock interrupt
      	to use as a surrogate timebase should jiffies stop counting.
      
      o	Fix a theoretical race between force_quiescent_state() and
      	grace-period initialization.  Yes, initialization does have to
      	go on for some jiffies for this race to occur, but given enough
      	CPUs...
      
      Updates from v6 (http://lkml.org/lkml/2008/9/23/448):
      
      o	Fix a number of checkpatch.pl complaints.
      
      o	Apply review comments from Ingo Molnar and Lai Jiangshan
      	on the stall-detection code.
      
      o	Fix several bugs in !CONFIG_SMP builds.
      
      o	Fix a misspelled config-parameter name so that RCU now announces
      	at boot time if stall detection is configured.
      
      o	Run tests on numerous combinations of configurations parameters,
      	which after the fixes above, now build and run correctly.
      
      Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):
      
      o	Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
      	changeset some time ago, and finally got around to retesting
      	this option).
      
      o	Fix some tracing bugs in rcupreempt that caused incorrect
      	totals to be printed.
      
      o	I now test with a more brutal random-selection online/offline
      	script (attached).  Probably more brutal than it needs to be
      	on the people reading it as well, but so it goes.
      
      o	A number of optimizations and usability improvements:
      
      	o	Make rcu_pending() ignore the grace-period timeout when
      		there is no grace period in progress.
      
      	o	Make force_quiescent_state() avoid going for a global
      		lock in the case where there is no grace period in
      		progress.
      
      	o	Rearrange struct fields to improve struct layout.
      
      	o	Make call_rcu() initiate a grace period if RCU was
      		idle, rather than waiting for the next scheduling
      		clock interrupt.
      
      	o	Invoke rcu_irq_enter() and rcu_irq_exit() only when
      		idle, as suggested by Andi Kleen.  I still don't
      		completely trust this change, and might back it out.
      
      	o	Make CONFIG_RCU_TRACE be the single config variable
      		manipulated for all forms of RCU, instead of the prior
      		confusion.
      
      	o	Document tracing files and formats for both rcupreempt
      		and rcutree.
      
      Updates from v4 for those missing v5 given its bad subject line:
      
      o	Separated dynticks interface so that NMIs and irqs call separate
      	functions, greatly simplifying it.  In particular, this code
      	no longer requires a proof of correctness.  ;-)
      
      o	Separated dynticks state out into its own per-CPU structure,
      	avoiding the duplicated accounting.
      
      o	The case where a dynticks-idle CPU runs an irq handler that
      	invokes call_rcu() is now correctly handled, forcing that CPU
      	out of dynticks-idle mode.
      
      o	Review comments have been applied (thank you all!!!).
      	For but one example, fixed the dynticks-ordering issue that
      	Manfred pointed out, saving me much debugging.  ;-)
      
      o	Adjusted rcuclassic and rcupreempt to handle dynticks changes.
      
      Attached is an updated patch to Classic RCU that applies a hierarchy,
      greatly reducing the contention on the top-level lock for large machines.
      This passes 10-hour concurrent rcutorture and online-offline testing on
      128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
      bugs in presence of dynticks (exciting working on a system where
      "sleep 1" hangs until interrupted...), which were fixed in the
      2.6.27 kernel.  It is getting more reliable than mainline by some
      measures, so the next version will be against -tip for inclusion.
      See also Manfred Spraul's recent patches (or his earlier work from
      2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
      We will converge onto a common patch in the fullness of time, but are
      currently exploring different regions of the design space.  That said,
      I have already gratefully stolen quite a few of Manfred's ideas.
      
      This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
      of the RCU hierarchy.  Defaults to 32 on 32-bit machines and 64 on
      64-bit machines.  If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
      there is no hierarchy.  By default, the RCU initialization code will
      adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
      architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
      this balancing, allowing the hierarchy to be exactly aligned to the
      underlying hardware.  Up to two levels of hierarchy are permitted
      (in addition to the root node), allowing up to 16,384 CPUs on 32-bit
      systems and up to 262,144 CPUs on 64-bit systems.  I just know that I
      am going to regret saying this, but this seems more than sufficient
      for the foreseeable future.  (Some architectures might wish to set
      CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
      If this becomes a real problem, additional levels can be added, but I
      doubt that it will make a significant difference on real hardware.)
      
      In the common case, a given CPU will manipulate its private rcu_data
      structure and the rcu_node structure that it shares with its immediate
      neighbors.  This can reduce both lock and memory contention by multiple
      orders of magnitude, which should eliminate the need for the strange
      manipulations that are reported to be required when running Linux on
      very large systems.
      
      Some shortcomings:
      
      o	More bugs will probably surface as a result of an ongoing
      	line-by-line code inspection.
      
      	Patches will be provided as required.
      
      o	There are probably hangs, rcutorture failures, &c.  Seems
      	quite stable on a 128-CPU machine, but that is kind of small
      	compared to 4096 CPUs.  However, seems to do better than
      	mainline.
      
      	Patches will be provided as required.
      
      o	The memory footprint of this version is several KB larger
      	than rcuclassic.
      
      	A separate UP-only rcutiny patch will be provided, which will
      	reduce the memory footprint significantly, even compared
      	to the old rcuclassic.  One such patch passes light testing,
      	and has a memory footprint smaller even than rcuclassic.
      	Initial reaction from various embedded guys was "it is not
      	worth it", so am putting it aside.
      
      Credits:
      
      o	Manfred Spraul for ideas, review comments, and bugs spotted,
      	as well as some good friendly competition.  ;-)
      
      o	Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
      	Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
      	for reviews and comments.
      
      o	Thomas Gleixner for much-needed help with some timer issues
      	(see patches below).
      
      o	Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
      	Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
      	Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
      	alive despite my heavy abuse^Wtesting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      64db4cff
  5. 08 12月, 2008 1 次提交
    • F
      tracing/function-graph-tracer: introduce __notrace_funcgraph to filter special functions · 8b96f011
      Frederic Weisbecker 提交于
      Impact: trace more functions
      
      When the function graph tracer is configured, three more files are not
      traced to prevent only four functions to be traced. And this impacts the
      normal function tracer too.
      
      arch/x86/kernel/process_64/32.c:
      
      I had crashes when I let this file traced. After some debugging, I saw
      that the "current" task point was changed inside__swtich_to(), ie:
      "write_pda(pcurrent, next_p);" inside process_64.c Since the tracer store
      the original return address of the function inside current, we had
      crashes. Only __switch_to() has to be excluded from tracing.
      
      kernel/module.c and kernel/extable.c:
      
      Because of a function used internally by the function graph tracer:
      __kernel_text_address()
      
      To let the other functions inside these files to be traced, this patch
      introduces the __notrace_funcgraph function prefix which is __notrace if
      function graph tracer is configured and nothing if not.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8b96f011
  6. 26 11月, 2008 1 次提交
  7. 18 11月, 2008 1 次提交
  8. 14 11月, 2008 1 次提交
  9. 11 11月, 2008 2 次提交
    • F
      tracing, x86: add low level support for ftrace return tracing · caf4b323
      Frederic Weisbecker 提交于
      Impact: add infrastructure for function-return tracing
      
      Add low level support for ftrace return tracing.
      
      This plug-in stores return addresses on the thread_info structure of
      the current task.
      
      The index of the current return address is initialized when the task
      is the first one (init) and when a process forks (the child). It is
      not needed when a task does a sys_execve because after this syscall,
      it still needs to return on the kernel functions it called.
      
      Note that the code of return_to_handler has been suggested by Steven
      Rostedt as almost all of the ideas of improvements in this V3.
      
      For purpose of security, arch/x86/kernel/process_32.c is not traced
      because __switch_to() changes the current task during its execution.
      That could cause inconsistency in the stored return address of this
      function even if I didn't have any crash after testing with tracing on
      this function enabled.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      caf4b323
    • I
      sched: rename SCHED_NO_NO_OMIT_FRAME_POINTER => SCHED_OMIT_FRAME_POINTER · ae1e9130
      Ingo Molnar 提交于
      Impact: cleanup, change .config option name
      
      We had this ugly config name for a long time for hysteric raisons.
      Rename it to a saner name.
      
      We still cannot get rid of it completely, until /proc/<pid>/stack
      usage replaces WCHAN usage for good.
      
      We'll be able to do that in the v2.6.29/v2.6.30 timeframe.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ae1e9130
  10. 03 11月, 2008 1 次提交
  11. 21 10月, 2008 1 次提交
  12. 20 10月, 2008 2 次提交
    • M
      container freezer: implement freezer cgroup subsystem · dc52ddc0
      Matt Helsley 提交于
      This patch implements a new freezer subsystem in the control groups
      framework.  It provides a way to stop and resume execution of all tasks in
      a cgroup by writing in the cgroup filesystem.
      
      The freezer subsystem in the container filesystem defines a file named
      freezer.state.  Writing "FROZEN" to the state file will freeze all tasks
      in the cgroup.  Subsequently writing "RUNNING" will unfreeze the tasks in
      the cgroup.  Reading will return the current state.
      
      * Examples of usage :
      
         # mkdir /containers/freezer
         # mount -t cgroup -ofreezer freezer  /containers
         # mkdir /containers/0
         # echo $some_pid > /containers/0/tasks
      
      to get status of the freezer subsystem :
      
         # cat /containers/0/freezer.state
         RUNNING
      
      to freeze all tasks in the container :
      
         # echo FROZEN > /containers/0/freezer.state
         # cat /containers/0/freezer.state
         FREEZING
         # cat /containers/0/freezer.state
         FROZEN
      
      to unfreeze all tasks in the container :
      
         # echo RUNNING > /containers/0/freezer.state
         # cat /containers/0/freezer.state
         RUNNING
      
      This is the basic mechanism which should do the right thing for user space
      task in a simple scenario.
      
      It's important to note that freezing can be incomplete.  In that case we
      return EBUSY.  This means that some tasks in the cgroup are busy doing
      something that prevents us from completely freezing the cgroup at this
      time.  After EBUSY, the cgroup will remain partially frozen -- reflected
      by freezer.state reporting "FREEZING" when read.  The state will remain
      "FREEZING" until one of these things happens:
      
      	1) Userspace cancels the freezing operation by writing "RUNNING" to
      		the freezer.state file
      	2) Userspace retries the freezing operation by writing "FROZEN" to
      		the freezer.state file (writing "FREEZING" is not legal
      		and returns EIO)
      	3) The tasks that blocked the cgroup from entering the "FROZEN"
      		state disappear from the cgroup's set of tasks.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: export thaw_process]
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Tested-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc52ddc0
    • M
      container freezer: make refrigerator always available · 8174f150
      Matt Helsley 提交于
      Now that the TIF_FREEZE flag is available in all architectures, extract
      the refrigerator() and freeze_task() from kernel/power/process.c and make
      it available to all.
      
      The refrigerator() can now be used in a control group subsystem
      implementing a control group freezer.
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Tested-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8174f150
  13. 14 10月, 2008 1 次提交
    • M
      tracing: Kernel Tracepoints · 97e1c18e
      Mathieu Desnoyers 提交于
      Implementation of kernel tracepoints. Inspired from the Linux Kernel
      Markers. Allows complete typing verification by declaring both tracing
      statement inline functions and probe registration/unregistration static
      inline functions within the same macro "DEFINE_TRACE". No format string
      is required. See the tracepoint Documentation and Samples patches for
      usage examples.
      
      Taken from the documentation patch :
      
      "A tracepoint placed in code provides a hook to call a function (probe)
      that you can provide at runtime. A tracepoint can be "on" (a probe is
      connected to it) or "off" (no probe is attached). When a tracepoint is
      "off" it has no effect, except for adding a tiny time penalty (checking
      a condition for a branch) and space penalty (adding a few bytes for the
      function call at the end of the instrumented function and adds a data
      structure in a separate section).  When a tracepoint is "on", the
      function you provide is called each time the tracepoint is executed, in
      the execution context of the caller. When the function provided ends its
      execution, it returns to the caller (continuing from the tracepoint
      site).
      
      You can put tracepoints at important locations in the code. They are
      lightweight hooks that can pass an arbitrary number of parameters, which
      prototypes are described in a tracepoint declaration placed in a header
      file."
      
      Addition and removal of tracepoints is synchronized by RCU using the
      scheduler (and preempt_disable) as guarantees to find a quiescent state
      (this is really RCU "classic"). The update side uses rcu_barrier_sched()
      with call_rcu_sched() and the read/execute side uses
      "preempt_disable()/preempt_enable()".
      
      We make sure the previous array containing probes, which has been
      scheduled for deletion by the rcu callback, is indeed freed before we
      proceed to the next update. It therefore limits the rate of modification
      of a single tracepoint to one update per RCU period. The objective here
      is to permit fast batch add/removal of probes on _different_
      tracepoints.
      
      Changelog :
      - Use #name ":" #proto as string to identify the tracepoint in the
        tracepoint table. This will make sure not type mismatch happens due to
        connexion of a probe with the wrong type to a tracepoint declared with
        the same name in a different header.
      - Add tracepoint_entry_free_old.
      - Change __TO_TRACE to get rid of the 'i' iterator.
      
      Masami Hiramatsu <mhiramat@redhat.com> :
      Tested on x86-64.
      
      Performance impact of a tracepoint : same as markers, except that it
      adds about 70 bytes of instructions in an unlikely branch of each
      instrumented function (the for loop, the stack setup and the function
      call). It currently adds a memory read, a test and a conditional branch
      at the instrumentation site (in the hot path). Immediate values will
      eventually change this into a load immediate, test and branch, which
      removes the memory read which will make the i-cache impact smaller
      (changing the memory read for a load immediate removes 3-4 bytes per
      site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
      also saves the d-cache hit).
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added.
      
      Quoting Hideo Aoki about Markers :
      
      I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
      tree, which includes several markers for LTTng, using an ia64 server.
      
      While the immediate trace mark feature isn't implemented on ia64, there
      is no major performance regression. So, I think that we don't have any
      issues to propose merging marker point patches into Linus's tree from
      the viewpoint of performance impact.
      
      I prepared two kernels to evaluate. The first one was compiled without
      CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
      
      I downloaded the original hackbench from the following URL:
      http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
      
      I ran hackbench 5 times in each condition and calculated the average and
      difference between the kernels.
      
          The parameter of hackbench: every 50 from 50 to 800
          The number of CPUs of the server: 2, 4, and 8
      
      Below is the results. As you can see, major performance regression
      wasn't found in any case. Even if number of processes increases,
      differences between marker-enabled kernel and marker- disabled kernel
      doesn't increase. Moreover, if number of CPUs increases, the differences
      doesn't increase either.
      
      Curiously, marker-enabled kernel is better than marker-disabled kernel
      in more than half cases, although I guess it comes from the difference
      of memory access pattern.
      
      * 2 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      4.811   |       4.872  |  +0.061  |  +1.27  |
            100 |      9.854   |      10.309  |  +0.454  |  +4.61  |
            150 |     15.602   |      15.040  |  -0.562  |  -3.6   |
            200 |     20.489   |      20.380  |  -0.109  |  -0.53  |
            250 |     25.798   |      25.652  |  -0.146  |  -0.56  |
            300 |     31.260   |      30.797  |  -0.463  |  -1.48  |
            350 |     36.121   |      35.770  |  -0.351  |  -0.97  |
            400 |     42.288   |      42.102  |  -0.186  |  -0.44  |
            450 |     47.778   |      47.253  |  -0.526  |  -1.1   |
            500 |     51.953   |      52.278  |  +0.325  |  +0.63  |
            550 |     58.401   |      57.700  |  -0.701  |  -1.2   |
            600 |     63.334   |      63.222  |  -0.112  |  -0.18  |
            650 |     68.816   |      68.511  |  -0.306  |  -0.44  |
            700 |     74.667   |      74.088  |  -0.579  |  -0.78  |
            750 |     78.612   |      79.582  |  +0.970  |  +1.23  |
            800 |     85.431   |      85.263  |  -0.168  |  -0.2   |
      --------------------------------------------------------------
      
      * 4 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.586   |       2.584  |  -0.003  |  -0.1   |
            100 |      5.254   |       5.283  |  +0.030  |  +0.56  |
            150 |      8.012   |       8.074  |  +0.061  |  +0.76  |
            200 |     11.172   |      11.000  |  -0.172  |  -1.54  |
            250 |     13.917   |      14.036  |  +0.119  |  +0.86  |
            300 |     16.905   |      16.543  |  -0.362  |  -2.14  |
            350 |     19.901   |      20.036  |  +0.135  |  +0.68  |
            400 |     22.908   |      23.094  |  +0.186  |  +0.81  |
            450 |     26.273   |      26.101  |  -0.172  |  -0.66  |
            500 |     29.554   |      29.092  |  -0.461  |  -1.56  |
            550 |     32.377   |      32.274  |  -0.103  |  -0.32  |
            600 |     35.855   |      35.322  |  -0.533  |  -1.49  |
            650 |     39.192   |      38.388  |  -0.804  |  -2.05  |
            700 |     41.744   |      41.719  |  -0.025  |  -0.06  |
            750 |     45.016   |      44.496  |  -0.520  |  -1.16  |
            800 |     48.212   |      47.603  |  -0.609  |  -1.26  |
      --------------------------------------------------------------
      
      * 8 CPUs
      
      Number of | without      | with         | diff     | diff    |
      processes | Marker [Sec] | Marker [Sec] |   [Sec]  |   [%]   |
      --------------------------------------------------------------
             50 |      2.094   |       2.072  |  -0.022  |  -1.07  |
            100 |      4.162   |       4.273  |  +0.111  |  +2.66  |
            150 |      6.485   |       6.540  |  +0.055  |  +0.84  |
            200 |      8.556   |       8.478  |  -0.078  |  -0.91  |
            250 |     10.458   |      10.258  |  -0.200  |  -1.91  |
            300 |     12.425   |      12.750  |  +0.325  |  +2.62  |
            350 |     14.807   |      14.839  |  +0.032  |  +0.22  |
            400 |     16.801   |      16.959  |  +0.158  |  +0.94  |
            450 |     19.478   |      19.009  |  -0.470  |  -2.41  |
            500 |     21.296   |      21.504  |  +0.208  |  +0.98  |
            550 |     23.842   |      23.979  |  +0.137  |  +0.57  |
            600 |     26.309   |      26.111  |  -0.198  |  -0.75  |
            650 |     28.705   |      28.446  |  -0.259  |  -0.9   |
            700 |     31.233   |      31.394  |  +0.161  |  +0.52  |
            750 |     34.064   |      33.720  |  -0.344  |  -1.01  |
            800 |     36.320   |      36.114  |  -0.206  |  -0.57  |
      --------------------------------------------------------------
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: N'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      97e1c18e
  14. 26 7月, 2008 1 次提交
    • A
      build kernel/profile.o only when requested · b03f6489
      Adrian Bunk 提交于
      Build kernel/profile.o only if CONFIG_PROFILING is enabled.
      
      This makes CONFIG_PROFILING=n kernels smaller.
      
      As a bonus, some profile_tick() calls and one branch from schedule() are
      now eliminated with CONFIG_PROFILING=n (but I doubt these are
      measurable effects).
      
      This patch changes the effects of CONFIG_PROFILING=n, but I don't think
      having more than two choices would be the better choice.
      
      This patch also adds the name of the first parameter to the prototypes
      of profile_{hits,tick}() since I anyway had to add them for the dummy
      functions.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03f6489
  15. 18 7月, 2008 1 次提交
  16. 17 7月, 2008 1 次提交
  17. 11 7月, 2008 1 次提交
  18. 30 6月, 2008 1 次提交
  19. 26 6月, 2008 1 次提交
    • J
      Add generic helpers for arch IPI function calls · 3d442233
      Jens Axboe 提交于
      This adds kernel/smp.c which contains helpers for IPI function calls. In
      addition to supporting the existing smp_call_function() in a more efficient
      manner, it also adds a more scalable variant called smp_call_function_single()
      for calling a given function on a single CPU only.
      
      The core of this is based on the x86-64 patch from Nick Piggin, lots of
      changes since then. "Alan D. Brunelle" <Alan.Brunelle@hp.com> has
      contributed lots of fixes and suggestions as well. Also thanks to
      Paul E. McKenney <paulmck@linux.vnet.ibm.com> for reviewing RCU usage
      and getting rid of the data allocation fallback deadlock.
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      3d442233
  20. 06 6月, 2008 2 次提交
  21. 24 5月, 2008 4 次提交
  22. 06 5月, 2008 1 次提交
    • P
      sched: add optional support for CONFIG_HAVE_UNSTABLE_SCHED_CLOCK · 3e51f33f
      Peter Zijlstra 提交于
      this replaces the rq->clock stuff (and possibly cpu_clock()).
      
       - architectures that have an 'imperfect' hardware clock can set
         CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
      
       - the 'jiffie' window might be superfulous when we update tick_gtod
         before the __update_sched_clock() call in sched_clock_tick()
      
       - cpu_clock() might be implemented as:
      
           sched_clock_cpu(smp_processor_id())
      
         if the accuracy proves good enough - how far can TSC drift in a
         single jiffie when considering the filtering and idle hooks?
      
      [ mingo@elte.hu: various fixes and cleanups ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e51f33f
  23. 29 4月, 2008 1 次提交
  24. 18 4月, 2008 1 次提交
  25. 17 4月, 2008 1 次提交
  26. 09 2月, 2008 4 次提交
    • H
      avoid overflows in kernel/time.c · bdc80787
      H. Peter Anvin 提交于
      When the conversion factor between jiffies and milli- or microseconds is
      not a single multiply or divide, as for the case of HZ == 300, we currently
      do a multiply followed by a divide.  The intervening result, however, is
      subject to overflows, especially since the fraction is not simplified (for
      HZ == 300, we multiply by 300 and divide by 1000).
      
      This is exposed to the user when passing a large timeout to poll(), for
      example.
      
      This patch replaces the multiply-divide with a reciprocal multiplication on
      32-bit platforms.  When the input is an unsigned long, there is no portable
      way to do this on 64-bit platforms there is no portable way to do this
      since it requires a 128-bit intermediate result (which gcc does support on
      64-bit platforms but may generate libgcc calls, e.g.  on 64-bit s390), but
      since the output is a 32-bit integer in the cases affected, just simplify
      the multiply-divide (*3/10 instead of *300/1000).
      
      The reciprocal multiply used can have off-by-one errors in the upper half
      of the valid output range.  This could be avoided at the expense of having
      to deal with a potential 65-bit intermediate result.  Since the intent is
      to avoid overflow problems and most of the other time conversions are only
      semiexact, the off-by-one errors were considered an acceptable tradeoff.
      
      At Ralf Baechle's suggestion, this version uses a Perl script to compute
      the necessary constants.  We already have dependencies on Perl for kernel
      compiles.  This does, however, require the Perl module Math::BigInt, which
      is included in the standard Perl distribution starting with version 5.8.0.
      In order to support older versions of Perl, include a table of canned
      constants in the script itself, and structure the script so that
      Math::BigInt isn't required if pulling values from said table.
      
      Running the script requires that the HZ value is available from the
      Makefile.  Thus, this patch also adds the Kconfig variable CONFIG_HZ to the
      architectures which didn't already have it (alpha, cris, frv, h8300, m32r,
      m68k, m68knommu, sparc, v850, and xtensa.) It does *not* touch the sh or
      sh64 architectures, since Paul Mundt has dealt with those separately in the
      sh tree.
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>,
      Cc: Sam Ravnborg <sam@ravnborg.org>,
      Cc: Paul Mundt <lethal@linux-sh.org>,
      Cc: Richard Henderson <rth@twiddle.net>,
      Cc: Michael Starvik <starvik@axis.com>,
      Cc: David Howells <dhowells@redhat.com>,
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>,
      Cc: Hirokazu Takata <takata@linux-m32r.org>,
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>,
      Cc: Roman Zippel <zippel@linux-m68k.org>,
      Cc: William L. Irwin <sparclinux@vger.kernel.org>,
      Cc: Chris Zankel <chris@zankel.net>,
      Cc: H. Peter Anvin <hpa@zytor.com>,
      Cc: Jan Engelhardt <jengelh@computergmbh.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdc80787
    • P
      namespaces: cleanup the code managed with PID_NS option · 74bd59bb
      Pavel Emelyanov 提交于
      Just like with the user namespaces, move the namespace management code into
      the separate .c file and mark the (already existing) PID_NS option as "depend
      on NAMESPACES"
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74bd59bb
    • P
      namespaces: cleanup the code managed with the USER_NS option · aee16ce7
      Pavel Emelyanov 提交于
      Make the user_namespace.o compilation depend on this option and move the
      init_user_ns into user.c file to make the kernel compile and work without the
      namespaces support.  This make the user namespace code be organized similar to
      other namespaces'.
      
      Also mask the USER_NS option as "depend on NAMESPACES".
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aee16ce7
    • P
      namespaces: move the UTS namespace under UTS_NS option · 58bfdd6d
      Pavel Emelyanov 提交于
      Currently all the namespace management code is in the kernel/utsname.c file,
      so just compile it out and make stubs in the appropriate header.
      
      The init namespace itself is in init/version.c and is in the kernel all the
      time.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58bfdd6d
  27. 08 2月, 2008 1 次提交
  28. 06 2月, 2008 2 次提交
    • M
      latency.c: use QoS infrastructure · f011e2e2
      Mark Gross 提交于
      Replace latency.c use with pm_qos_params use.
      Signed-off-by: Nmark gross <mgross@linux.intel.com>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Jaroslav Kysela <perex@suse.cz>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f011e2e2
    • M
      pm qos infrastructure and interface · d82b3518
      Mark Gross 提交于
      The following patch is a generalization of the latency.c implementation done
      by Arjan last year.  It provides infrastructure for more than one parameter,
      and exposes a user mode interface for processes to register pm_qos
      expectations of processes.
      
      This interface provides a kernel and user mode interface for registering
      performance expectations by drivers, subsystems and user space applications on
      one of the parameters.
      
      Currently we have {cpu_dma_latency, network_latency, network_throughput} as
      the initial set of pm_qos parameters.
      
      The infrastructure exposes multiple misc device nodes one per implemented
      parameter.  The set of parameters implement is defined by pm_qos_power_init()
      and pm_qos_params.h.  This is done because having the available parameters
      being runtime configurable or changeable from a driver was seen as too easy to
      abuse.
      
      For each parameter a list of performance requirements is maintained along with
      an aggregated target value.  The aggregated target value is updated with
      changes to the requirement list or elements of the list.  Typically the
      aggregated target value is simply the max or min of the requirement values
      held in the parameter list elements.
      
      >From kernel mode the use of this interface is simple:
      
      pm_qos_add_requirement(param_id, name, target_value):
      
        Will insert a named element in the list for that identified PM_QOS
        parameter with the target value.  Upon change to this list the new target is
        recomputed and any registered notifiers are called only if the target value
        is now different.
      
      pm_qos_update_requirement(param_id, name, new_target_value):
      
        Will search the list identified by the param_id for the named list element
        and then update its target value, calling the notification tree if the
        aggregated target is changed.  with that name is already registered.
      
      pm_qos_remove_requirement(param_id, name):
      
        Will search the identified list for the named element and remove it, after
        removal it will update the aggregate target and call the notification tree
        if the target was changed as a result of removing the named requirement.
      
      >From user mode:
      
        Only processes can register a pm_qos requirement.  To provide for
        automatic cleanup for process the interface requires the process to register
        its parameter requirements in the following way:
      
        To register the default pm_qos target for the specific parameter, the
        process must open one of /dev/[cpu_dma_latency, network_latency,
        network_throughput]
      
        As long as the device node is held open that process has a registered
        requirement on the parameter.  The name of the requirement is
        "process_<PID>" derived from the current->pid from within the open system
        call.
      
        To change the requested target value the process needs to write a s32
        value to the open device node.  This translates to a
        pm_qos_update_requirement call.
      
        To remove the user mode request for a target value simply close the device
        node.
      
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: fix build again]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: Nmark gross <mgross@linux.intel.com>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Jaroslav Kysela <perex@suse.cz>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Adam Belay <abelay@novell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d82b3518
  29. 03 2月, 2008 1 次提交
    • P
      kobject: Always build in kernel/ksysfs.o. · dfacd68e
      Paul Mundt 提交于
      kernel/ksysfs.c seems to be a random dumping group for misc globals
      that the rest of the tree depend on. This has caused problems with
      exports in the past when sysfs is disabled, which can already be
      observed in commit-id 51107301.
      
      The latest one is the kernel_kobj usage, which presently results in:
      
      fs/built-in.o: In function `debugfs_init':
      inode.c:(.init.text+0xc34): undefined reference to `kernel_kobj'
      make: *** [.tmp_vmlinux1] Error 1
      
      kernel/ksysfs.c itself at this point only contains globals and some
      basic sysfs initialization, the sysfs initialization code is optimized
      out when we build with sysfs disabled. Given that, it's easier to just
      build in unconditionally, rather than trying to find some other random
      place to dump and initialize the globals.
      
      Additionally, the current trend seems to be decoupling of kobjects from
      sysfs, in which case it still makes sense to perform the kernel_kobj
      initialization that happens here even if sysfs is disabled, as
      lib/kobject.o is built-in unconditionally.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      dfacd68e
  30. 30 1月, 2008 1 次提交