1. 10 5月, 2007 2 次提交
    • C
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter 提交于
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
    • R
      Add suspend-related notifications for CPU hotplug · 8bb78442
      Rafael J. Wysocki 提交于
      Since nonboot CPUs are now disabled after tasks and devices have been
      frozen and the CPU hotplug infrastructure is used for this purpose, we need
      special CPU hotplug notifications that will help the CPU-hotplug-aware
      subsystems distinguish normal CPU hotplug events from CPU hotplug events
      related to a system-wide suspend or resume operation in progress.  This
      patch introduces such notifications and causes them to be used during
      suspend and resume transitions.  It also changes all of the
      CPU-hotplug-aware subsystems to take these notifications into consideration
      (for now they are handled in the same way as the corresponding "normal"
      ones).
      
      [oleg@tv-sign.ru: cleanups]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bb78442
  2. 09 5月, 2007 2 次提交
  3. 08 5月, 2007 6 次提交
  4. 02 3月, 2007 1 次提交
  5. 21 2月, 2007 2 次提交
  6. 12 2月, 2007 8 次提交
    • C
      [PATCH] optional ZONE_DMA: optional ZONE_DMA in the VM · 4b51d669
      Christoph Lameter 提交于
      Make ZONE_DMA optional in core code.
      
      - ifdef all code for ZONE_DMA and related definitions following the example
        for ZONE_DMA32 and ZONE_HIGHMEM.
      
      - Without ZONE_DMA, ZONE_HIGHMEM and ZONE_DMA32 we get to a ZONES_SHIFT of
        0.
      
      - Modify the VM statistics to work correctly without a DMA zone.
      
      - Modify slab to not create DMA slabs if there is no ZONE_DMA.
      
      [akpm@osdl.org: cleanup]
      [jdike@addtoit.com: build fix]
      [apw@shadowen.org: Simplify calculation of the number of bits we need for ZONES_SHIFT]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NJeff Dike <jdike@addtoit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b51d669
    • C
      [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone · 6267276f
      Christoph Lameter 提交于
      This patchset follows up on the earlier work in Andrew's tree to reduce the
      number of zones.  The patches allow to go to a minimum of 2 zones.  This one
      allows also to make ZONE_DMA optional and therefore the number of zones can be
      reduced to one.
      
      ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
      why we would not want to have ZONE_DMA
      
      1. Some arches do not need ZONE_DMA at all.
      
      2. With the advent of IOMMUs DMA zones are no longer needed.
         The necessity of DMA zones may drastically be reduced
         in the future. This patchset allows a compilation of
         a kernel without that overhead.
      
      3. Devices that require ISA DMA get rare these days. All
         my systems do not have any need for ISA DMA.
      
      4. The presence of an additional zone unecessarily complicates
         VM operations because it must be scanned and balancing
         logic must operate on its.
      
      5. With only ZONE_NORMAL one can reach the situation where
         we have only one zone. This will allow the unrolling of many
         loops in the VM and allows the optimization of varous
         code paths in the VM.
      
      6. Having only a single zone in a NUMA system results in a
         1-1 correspondence between nodes and zones. Various additional
         optimizations to critical VM paths become possible.
      
      Many systems today can operate just fine with a single zone.  If you look at
      what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
      are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
      will be empty instead).
      
      On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
      constantly look at an empty zone in /proc/zoneinfo and empty slab in
      /proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
      stay empty.
      
      The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).
      
      The RFC posted earlier (see
      http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
      of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
      and make this as compact as possible.  The job was made much easier by the
      ongoing efforts of others to extract common arch specific functionality.
      
      I have been running this for awhile now on my desktop and finally Linux is
      using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:
      
      christoph@pentium940:~$ cat /proc/zoneinfo
      Node 0, zone   Normal
        pages free     4435
              min      1448
              low      1810
              high     2172
              active   241786
              inactive 210170
              scanned  0 (a: 0 i: 0)
              spanned  524224
              present  524224
          nr_anon_pages 61680
          nr_mapped    14271
          nr_file_pages 390264
          nr_slab_reclaimable 27564
          nr_slab_unreclaimable 1793
          nr_page_table_pages 449
          nr_dirty     39
          nr_writeback 0
          nr_unstable  0
          nr_bounce    0
          cpu: 0 pcp: 0
                    count: 156
                    high:  186
                    batch: 31
          cpu: 0 pcp: 1
                    count: 9
                    high:  62
                    batch: 15
        vm stats threshold: 20
          cpu: 1 pcp: 0
                    count: 177
                    high:  186
                    batch: 31
          cpu: 1 pcp: 1
                    count: 12
                    high:  62
                    batch: 15
        vm stats threshold: 20
        all_unreclaimable: 0
        prev_priority:     12
        temp_priority:     12
        start_pfn:         0
      
      This patch:
      
      In two places in the VM we use ZONE_DMA to refer to the first zone.  If
      ZONE_DMA is optional then other zones may be first.  So simply replace
      ZONE_DMA with zone 0.
      
      This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
      ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
      number related to a pgdat.  However, we still need a zonetable to index all
      the zones for each node if this is a NUMA system.  Therefore define
      ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.
      
      [apw@shadowen.org: fix mismerge]
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6267276f
    • C
      [PATCH] Drop get_zone_counts() · 65e458d4
      Christoph Lameter 提交于
      Values are available via ZVC sums.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65e458d4
    • C
      [PATCH] Drop nr_free_pages_pgdat() · 9195481d
      Christoph Lameter 提交于
      Function is unnecessary now.  We can use the summing features of the ZVCs to
      get the values we need.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9195481d
    • C
      [PATCH] Drop free_pages() · 96177299
      Christoph Lameter 提交于
      nr_free_pages is now a simple access to a global variable.  Make it a macro
      instead of a function.
      
      The nr_free_pages now requires vmstat.h to be included.  There is one
      occurrence in power management where we need to add the include.  Directly
      refrer to global_page_state() there to clarify why the #include was added.
      
      [akpm@osdl.org: arm build fix]
      [akpm@osdl.org: sparc64 build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96177299
    • C
      [PATCH] Use ZVC for free_pages · d23ad423
      Christoph Lameter 提交于
      This is again simplifies some of the VM counter calculations through the use
      of the ZVC consolidated counters.
      
      [michal.k.k.piotrowski@gmail.com: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMichal Piotrowski <michal.k.k.piotrowski@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d23ad423
    • C
      [PATCH] Use ZVC for inactive and active counts · c8785385
      Christoph Lameter 提交于
      The determination of the dirty ratio to determine writeback behavior is
      currently based on the number of total pages on the system.
      
      However, not all pages in the system may be dirtied.  Thus the ratio is always
      too low and can never reach 100%.  The ratio may be particularly skewed if
      large hugepage allocations, slab allocations or device driver buffers make
      large sections of memory not available anymore.  In that case we may get into
      a situation in which f.e.  the background writeback ratio of 40% cannot be
      reached anymore which leads to undesired writeback behavior.
      
      This patchset fixes that issue by determining the ratio based on the actual
      pages that may potentially be dirty.  These are the pages on the active and
      the inactive list plus free pages.
      
      The problem with those counts has so far been that it is expensive to
      calculate these because counts from multiple nodes and multiple zones will
      have to be summed up.  This patchset makes these counters ZVC counters.  This
      means that a current sum per zone, per node and for the whole system is always
      available via global variables and not expensive anymore to calculate.
      
      The patchset results in some other good side effects:
      
      - Removal of the various functions that sum up free, active and inactive
        page counts
      
      - Cleanup of the functions that display information via the proc filesystem.
      
      This patch:
      
      The use of a ZVC for nr_inactive and nr_active allows a simplification of some
      counter operations.  More ZVC functionality is used for sums etc in the
      following patches.
      
      [akpm@osdl.org: UP build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8785385
    • M
      [PATCH] Avoid excessive sorting of early_node_map[] · a6af2bc3
      Mel Gorman 提交于
      find_min_pfn_for_node() and find_min_pfn_with_active_regions() sort
      early_node_map[] on every call.  This is an excessive amount of sorting and
      that can be avoided.  This patch always searches the whole early_node_map[]
      in find_min_pfn_for_node() instead of returning the first value found.  The
      map is then only sorted once when required.  Successfully boot tested on a
      number of machines.
      
      [akpm@osdl.org: cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6af2bc3
  7. 10 2月, 2007 1 次提交
  8. 01 2月, 2007 1 次提交
  9. 12 1月, 2007 1 次提交
  10. 06 1月, 2007 2 次提交
    • C
      [PATCH] Check for populated zone in __drain_pages · f2e12bb2
      Christoph Lameter 提交于
      Both process_zones() and drain_node_pages() check for populated zones
      before touching pagesets.  However, __drain_pages does not do so,
      
      This may result in a NULL pointer dereference for pagesets in unpopulated
      zones if a NUMA setup is combined with cpu hotplug.
      
      Initially the unpopulated zone has the pcp pointers pointing to the boot
      pagesets.  Since the zone is not populated the boot pageset pointers will
      not be changed during page allocator and slab bootstrap.
      
      If a cpu is later brought down (first call to __drain_pages()) then the pcp
      pointers for cpus in unpopulated zones are set to NULL since __drain_pages
      does not first check for an unpopulated zone.
      
      If the cpu is then brought up again then we call process_zones() which will
      ignore the unpopulated zone.  So the pageset pointers will still be NULL.
      
      If the cpu is then again brought down then __drain_pages will attempt to
      drain pages by following the NULL pageset pointer for unpopulated zones.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f2e12bb2
    • P
      [PATCH] Sanely size hash tables when using large base pages · 9ab37b8f
      Paul Mundt 提交于
      At the moment the inode/dentry cache hash tables (common by way of
      alloc_large_system_hash()) are incorrectly sized by their respective
      detection logic when we attempt to use large base pages on systems with
      little memory.
      
      This results in odd behaviour when using a 64kB PAGE_SIZE, such as:
      
      Dentry cache hash table entries: 8192 (order: -1, 32768 bytes)
      Inode-cache hash table entries: 4096 (order: -2, 16384 bytes)
      
      The mount cache hash table is seemingly the only one that gets this right
      by directly taking PAGE_SIZE in to account.
      
      The following patch attempts to catch the bogus values and round it up to
      at least 0-order.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9ab37b8f
  11. 14 12月, 2006 1 次提交
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  12. 09 12月, 2006 3 次提交
    • D
      [PATCH] fault-injection: defaults likely to please a new user · 6b1b60f4
      Don Mullis 提交于
      Assign defaults most likely to please a new user:
       1) generate some logging output
          (verbose=2)
       2) avoid injecting failures likely to lock up UI
          (ignore_gfp_wait=1, ignore_gfp_highmem=1)
      Signed-off-by: NDon Mullis <dwm@meer.net>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6b1b60f4
    • A
      [PATCH] fault-injection capability for alloc_pages() · 933e312e
      Akinobu Mita 提交于
      This patch provides fault-injection capability for alloc_pages()
      
      Boot option:
      
      fail_page_alloc=<interval>,<probability>,<space>,<times>
      
      	<interval> -- specifies the interval of failures.
      
      	<probability> -- specifies how often it should fail in percent.
      
      	<space> -- specifies the size of free space where memory can be
      		   allocated safely in pages.
      
      	<times> -- specifies how many times failures may happen at most.
      
      Debugfs:
      
      /debug/fail_page_alloc/interval
      /debug/fail_page_alloc/probability
      /debug/fail_page_alloc/specifies
      /debug/fail_page_alloc/times
      /debug/fail_page_alloc/ignore-gfp-highmem
      /debug/fail_page_alloc/ignore-gfp-wait
      
      Example:
      
      	fail_page_alloc=10,100,0,-1
      
      The page allocation (alloc_pages(), ...) fails once per 10 times.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      933e312e
    • D
      [PATCH] LOG2: Implement a general integer log2 facility in the kernel · f0d1b0b3
      David Howells 提交于
      This facility provides three entry points:
      
      	ilog2()		Log base 2 of unsigned long
      	ilog2_u32()	Log base 2 of u32
      	ilog2_u64()	Log base 2 of u64
      
      These facilities can either be used inside functions on dynamic data:
      
      	int do_something(long q)
      	{
      		...;
      		y = ilog2(x)
      		...;
      	}
      
      Or can be used to statically initialise global variables with constant values:
      
      	unsigned n = ilog2(27);
      
      When performing static initialisation, the compiler will report "error:
      initializer element is not constant" if asked to take a log of zero or of
      something not reducible to a constant.  They treat negative numbers as
      unsigned.
      
      When not dealing with a constant, they fall back to using fls() which permits
      them to use arch-specific log calculation instructions - such as BSR on
      x86/x86_64 or SCAN on FRV - if available.
      
      [akpm@osdl.org: MMC fix]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Wojtek Kaniewski <wojtekka@toxygen.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f0d1b0b3
  13. 08 12月, 2006 10 次提交