1. 17 10月, 2007 15 次提交
    • P
      mm: dirty balancing for tasks · 3e26c149
      Peter Zijlstra 提交于
      Based on ideas of Andrew:
        http://marc.info/?l=linux-kernel&m=102912915020543&w=2
      
      Scale the bdi dirty limit inversly with the tasks dirty rate.
      This makes heavy writers have a lower dirty limit than the occasional writer.
      
      Andrea proposed something similar:
        http://lwn.net/Articles/152277/
      
      The main disadvantage to his patch is that he uses an unrelated quantity to
      measure time, which leaves him with a workload dependant tunable. Other than
      that the two approaches appear quite similar.
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e26c149
    • P
      mm: per device dirty threshold · 04fbfdc1
      Peter Zijlstra 提交于
      Scale writeback cache per backing device, proportional to its writeout speed.
      
      By decoupling the BDI dirty thresholds a number of problems we currently have
      will go away, namely:
      
       - mutual interference starvation (for any number of BDIs);
       - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).
      
      It might be that all dirty pages are for a single BDI while other BDIs are
      idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
      dirty pages outstanding and make progress.
      
      A global threshold also creates a deadlock for stacked BDIs; when A writes to
      B, and A generates enough dirty pages to get throttled, B will never start
      writeback until the dirty pages go away. Again, by giving each BDI its own
      'independent' dirty limit, this problem is avoided.
      
      So the problem is to determine how to distribute the total dirty limit across
      the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
      not have any dirty pages outstanding is a waste.
      
      What is done is to keep a floating proportion between the DBIs based on
      writeback completions. This way faster/more active devices get a larger share
      than slower/idle devices.
      
      [akpm@linux-foundation.org: fix warnings]
      [hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04fbfdc1
    • I
      time: introduce xtime_seconds · f20bf612
      Ingo Molnar 提交于
      improve performance of sys_time(). sys_time() returns time in seconds,
      but it does so by calling do_gettimeofday() and then returning the
      tv_sec portion of the GTOD time. But the data structure "xtime", which
      is updated by every timer/scheduler tick, already offers HZ granularity
      time.
      
      the patch improves the sysbench oltp macrobenchmark by 4-5% on an AMD
      dual-core system:
      
      v2.6.23:
      
      #threads
      
         1:     transactions:                        4073   (407.23 per sec.)
         2:     transactions:                        8530   (852.81 per sec.)
         3:     transactions:                        8321   (831.88 per sec.)
         4:     transactions:                        8407   (840.58 per sec.)
         5:     transactions:                        8070   (806.74 per sec.)
      
      v2.6.23 + sys_time-speedup.patch:
      
         1:     transactions:                        4281   (428.09 per sec.)
         2:     transactions:                        8910   (890.85 per sec.)
         3:     transactions:                        8659   (865.79 per sec.)
         4:     transactions:                        8676   (867.34 per sec.)
         5:     transactions:                        8532   (852.91 per sec.)
      
      and by 4-5% on an Intel dual-core system too:
      
      2.6.23:
      
        1:     transactions:                        4560   (455.94 per sec.)
        2:     transactions:                        10094  (1009.30 per sec.)
        3:     transactions:                        9755   (975.36 per sec.)
        4:     transactions:                        9859   (985.78 per sec.)
        5:     transactions:                        9701   (969.72 per sec.)
      
      2.6.23 + sys_time-speedup.patch:
      
        1:     transactions:                        4779   (477.84 per sec.)
        2:     transactions:                        10103  (1010.14 per sec.)
        3:     transactions:                        10141  (1013.93 per sec.)
        4:     transactions:                        10371  (1036.89 per sec.)
        5:     transactions:                        10178  (1017.50 per sec.)
      
      (the more CPUs the system has, the more speedup this patch gives for
      this particular workload.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f20bf612
    • M
      kprobes: support kretprobe blacklist · f438d914
      Masami Hiramatsu 提交于
      Introduce architecture dependent kretprobe blacklists to prohibit users
      from inserting return probes on the function in which kprobes can be
      inserted but kretprobes can not.
      
      This patch also removes "__kprobes" mark from "__switch_to" on x86_64 and
      registers "__switch_to" to the blacklist on x86-64, because that mark is to
      prohibit user from inserting only kretprobe.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f438d914
    • P
      cpuset: remove sched domain hooks from cpusets · 607717a6
      Paul Jackson 提交于
      Remove the cpuset hooks that defined sched domains depending on the setting
      of the 'cpu_exclusive' flag.
      
      The cpu_exclusive flag can only be set on a child if it is set on the
      parent.
      
      This made that flag painfully unsuitable for use as a flag defining a
      partitioning of a system.
      
      It was entirely unobvious to a cpuset user what partitioning of sched
      domains they would be causing when they set that one cpu_exclusive bit on
      one cpuset, because it depended on what CPUs were in the remainder of that
      cpusets siblings and child cpusets, after subtracting out other
      cpu_exclusive cpusets.
      
      Furthermore, there was no way on production systems to query the
      result.
      
      Using the cpu_exclusive flag for this was simply wrong from the get go.
      
      Fortunately, it was sufficiently borked that so far as I know, almost no
      successful use has been made of this.  One real time group did use it to
      affectively isolate CPUs from any load balancing efforts.  They are willing
      to adapt to alternative mechanisms for this, such as someway to manipulate
      the list of isolated CPUs on a running system.  They can do without this
      present cpu_exclusive based mechanism while we develop an alternative.
      
      There is a real risk, to the best of my understanding, of users
      accidentally setting up a partitioned scheduler domains, inhibiting desired
      load balancing across all their CPUs, due to the nonobvious (from the
      cpuset perspective) side affects of the cpu_exclusive flag.
      
      Furthermore, since there was no way on a running system to see what one was
      doing with sched domains, this change will be invisible to any using code.
      Unless they have real insight to the scheduler load balancing choices, they
      will be unable to detect that this change has been made in the kernel's
      behaviour.
      
      Initial discussion on lkml of this patch has generated much comment.  My
      (probably controversial) take on that discussion is that it has reached a
      rough concensus that the current cpuset cpu_exclusive mechanism for
      defining sched domains is borked.  There is no concensus on the
      replacement.  But since we can remove this mechanism, and since its
      continued presence risks causing unwanted partitioning of the schedulers
      load balancing, we should remove it while we can, as we proceed to work the
      replacement scheduler domain mechanisms.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      607717a6
    • C
      m32r: convert to generic sys_ptrace · 0ac15559
      Christoph Hellwig 提交于
      Convert m32r to the generic sys_ptrace.  The conversion requires an
      architecture hook after ptrace_attach which this patch adds.  The hook
      will also be needed for a conersion of ia64 to the generic ptrace code.
      
      Thanks to Hirokazu Takata for fixing a bug in the first version of this
      code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ac15559
    • A
      hugetlb: Add hugetlb_dynamic_pool sysctl · 54f9f80d
      Adam Litke 提交于
      The maximum size of the huge page pool can be controlled using the overall
      size of the hugetlb filesystem (via its 'size' mount option).  However in the
      common case the this will not be set as the pool is traditionally fixed in
      size at boot time.  In order to maintain the expected semantics, we need to
      prevent the pool expanding by default.
      
      This patch introduces a new sysctl controlling dynamic pool resizing.  When
      this is enabled the pool will expand beyond its base size up to the size of
      the hugetlb filesystem.  It is disabled by default.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54f9f80d
    • K
      memory unplug: memory hotplug cleanup · 75884fb1
      KAMEZAWA Hiroyuki 提交于
      A clean up patch for "scanning memory resource [start, end)" operation.
      
      Now, find_next_system_ram() function is used in memory hotplug, but this
      interface is not easy to use and codes are complicated.
      
      This patch adds walk_memory_resouce(start,len,arg,func) function.
      The function 'func' is called per valid memory resouce range in [start,pfn).
      
      [pbadari@us.ibm.com: Error handling in walk_memory_resource()]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75884fb1
    • M
      Group short-lived and reclaimable kernel allocations · e12ba74d
      Mel Gorman 提交于
      This patch marks a number of allocations that are either short-lived such as
      network buffers or are reclaimable such as inode allocations.  When something
      like updatedb is called, long-lived and unmovable kernel allocations tend to
      be spread throughout the address space which increases fragmentation.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
      reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
      them and re-reading the information from elsewhere.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12ba74d
    • C
      Memoryless nodes: Use N_HIGH_MEMORY for cpusets · 0e1e7c7a
      Christoph Lameter 提交于
      cpusets try to ensure that any node added to a cpuset's mems_allowed is
      on-line and contains memory.  The assumption was that online nodes contained
      memory.  Thus, it is possible to add memoryless nodes to a cpuset and then add
      tasks to this cpuset.  This results in continuous series of oom-kill and
      apparent system hang.
      
      Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a.  node_memory_map] in
      place of node_online_map when vetting memories.  Return error if admin
      attempts to write a non-empty mems_allowed node mask containing only
      memoryless-nodes.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e1e7c7a
    • C
      Memoryless nodes: Allow profiling data to fall back to other nodes · 4199cfa0
      Christoph Lameter 提交于
      Processors on memoryless nodes must be able to fall back to remote nodes in
      order to get a profiling buffer.  This may lead to excessive NUMA traffic but
      I think we should allow this rather than failing.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4199cfa0
    • C
      x86: optimize page faults like all other achitectures and kill notifier cruft · 74a0b576
      Christoph Hellwig 提交于
      x86(-64) are the last architectures still using the page fault notifier
      cruft for the kprobes page fault hook.  This patch converts them to the
      proper direct calls, and removes the now unused pagefault notifier bits
      aswell as the cruft in kprobes.c that was related to this mess.
      
      I know Andi didn't really like this, but all other architecture maintainers
      agreed the direct calls are much better and besides the obvious cruft
      removal a common way of dealing with kprobes across architectures is
      important aswell.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: fix sparc64]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Andi Kleen <ak@suse.de>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74a0b576
    • M
      Convert cpu_sibling_map to be a per cpu variable · d5a7430d
      Mike Travis 提交于
      Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
      variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
      from startup and CPU HOTPLUG functions.
      Signed-off-by: NMike Travis <travis@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5a7430d
    • R
      slow down printk during boot · bfe8df3d
      Randy Dunlap 提交于
      Optionally add a boot delay after each kernel printk() call, crudely
      measured in milliseconds, with a maximum delay of 10 seconds per printk.
      
      Enable CONFIG_BOOT_PRINTK_DELAY=y and then add (e.g.):
      "lpj=loops_per_jiffy boot_delay=100"
      to the kernel command line.
      
      It has been useful in cases like "during boot, my machine just reboots or the
      screen goes black" by slowing down printk, (and adding initcall_debug), we can
      usually see the last thing that happened before the lights went out which is
      usually a valuable clue.
      
      [akpm@linux-foundation.org: not all architectures implement CONFIG_HZ]
      [akpm@linux-foundation.org: fix lots of stuff]
      [bunk@stusta.de: kernel/printk.c: make 2 variables static]
      [heiko.carstens@de.ibm.com: fix slow down printk on boot compile error]
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Signed-off-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfe8df3d
    • A
      Consolidate PTRACE_DETACH · 1bcf5482
      Alexey Dobriyan 提交于
      Identical handlers of PTRACE_DETACH go into ptrace_request().
      Not touching compat code.
      Not touching archs that don't call ptrace_request.
      Signed-off-by: NAlexey Dobriyan <adobriyan@sw.ru>
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bcf5482
  2. 15 10月, 2007 25 次提交