1. 14 11月, 2005 2 次提交
  2. 11 11月, 2005 1 次提交
  3. 07 11月, 2005 1 次提交
  4. 30 10月, 2005 10 次提交
    • J
      [PATCH] mm: wider use of for_each_*cpu() · 2f96996d
      John Hawkes 提交于
      In 'mm' change the explicit use of a for-loop using NR_CPUS into the
      general for_each_cpu() constructs.  This widens the scope of potential
      future optimizations of the general constructs, as well as takes advantage
      of the existing optimizations of first_cpu() and next_cpu(), which is
      advantageous when the true CPU count is much smaller than NR_CPUS.
      Signed-off-by: NJohn Hawkes <hawkes@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2f96996d
    • D
      [PATCH] memory hotplug: sysfs and add/remove functions · 3947be19
      Dave Hansen 提交于
      This adds generic memory add/remove and supporting functions for memory
      hotplug into a new file as well as a memory hotplug kernel config option.
      
      Individual architecture patches will follow.
      
      For now, disable memory hotplug when swsusp is enabled.  There's a lot of
      churn there right now.  We'll fix it up properly once it calms down.
      Signed-off-by: NMatt Tolentino <matthew.e.tolentino@intel.com>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3947be19
    • D
      [PATCH] memory hotplug locking: zone span seqlock · bdc8cb98
      Dave Hansen 提交于
      See the "fixup bad_range()" patch for more information, but this actually
      creates a the lock to protect things making assumptions about a zone's size
      staying constant at runtime.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bdc8cb98
    • D
      [PATCH] memory hotplug locking: node_size_lock · 208d54e5
      Dave Hansen 提交于
      pgdat->node_size_lock is basically only neeeded in one place in the normal
      code: show_mem(), which is the arch-specific sysrq-m printing function.
      
      Strictly speaking, the architectures not doing memory hotplug do no need this
      locking in show_mem().  However, they are all included for completeness.  This
      should also make any future consolidation of all of the implementations a
      little more straightforward.
      
      This lock is also held in the sparsemem code during a memory removal, as
      sections are invalidated.  This is the place there pfn_valid() is made false
      for a memory area that's being removed.  The lock is only required when doing
      pfn_valid() operations on memory which the user does not already have a
      reference on the page, such as in show_mem().
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      208d54e5
    • D
      [PATCH] memory hotplug prep: fixup bad_range() · c6a57e19
      Dave Hansen 提交于
      When doing memory hotplug operations, the size of existing zones can obviously
      change.  This means that zone->zone_{start_pfn,spanned_pages} can change.
      
      There are currently no locks that protect these structure members.  However,
      they are rarely accessed at runtime.  Outside of swsusp, the only place that I
      can find is bad_range().
      
      So, split bad_range() up into two pieces: one that needs to be locked and
      anther that doesn't.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c6a57e19
    • D
      [PATCH] memory hotplug prep: break out zone initialization · ed8ece2e
      Dave Hansen 提交于
      If a zone is empty at boot-time and then hot-added to later, it needs to run
      the same init code that would have been run on it at boot.
      
      This patch breaks out zone table and per-cpu-pages functions for use by the
      hotplug code.  You can almost see all of the free_area_init_core() function on
      one page now.  :)
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ed8ece2e
    • H
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins 提交于
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • N
      [PATCH] core remove PageReserved · b5810039
      Nick Piggin 提交于
      Remove PageReserved() calls from core code by tightening VM_RESERVED
      handling in mm/ to cover PageReserved functionality.
      
      PageReserved special casing is removed from get_page and put_page.
      
      All setting and clearing of PageReserved is retained, and it is now flagged
      in the page_alloc checks to help ensure we don't introduce any refcount
      based freeing of Reserved pages.
      
      MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being
      deprecated.  We never completely handled it correctly anyway, and is be
      reintroduced in future if required (Hugh has a proof of concept).
      
      Once PageReserved() calls are removed from kernel/power/swsusp.c, and all
      arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can
      be trivially removed.
      
      Last real user of PageReserved is swsusp, which uses PageReserved to
      determine whether a struct page points to valid memory or not.  This still
      needs to be addressed (a generic page_is_ram() should work).
      
      A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and
      thus mapcounted and count towards shared rss).  These writes to the struct
      page could cause excessive cacheline bouncing on big systems.  There are a
      number of ways this could be addressed if it is an issue.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      Refcount bug fix for filemap_xip.c
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b5810039
    • S
      [PATCH] mm: set per-cpu-pages lower threshold to zero · e46a5e28
      Seth, Rohit 提交于
      Set the low water mark for hot pages in pcp to zero.
      
      (akpm: for the life of me I cannot remember why we created pcp->low.  Neither
      can Martin and the changelog is silent.  Maybe it was just a brainfart, but I
      have this feeling that there was a reason.  If not, we should remove the
      fields completely.  We'll see.)
      Signed-off-by: NRohit Seth <rohit.seth@intel.com>
      Cc: <linux-mm@kvack.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e46a5e28
    • S
      [PATCH] mm: page_alloc: increase size of per-cpu-pages · ba56e91c
      Seth, Rohit 提交于
      Increase the page allocator's per-cpu magazines from 1/4MB to 1/2MB.
      
      Over 100+ runs for a workload, the difference in mean is about 2%.  The best
      results for both are almost same.  Though the max variation in results with
      1/2MB is only 2.2%, whereas with 1/4MB it is 12%.
      Signed-off-by: NRohit Seth <rohit.seth@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ba56e91c
  5. 28 10月, 2005 2 次提交
  6. 27 10月, 2005 1 次提交
  7. 09 10月, 2005 1 次提交
  8. 13 9月, 2005 1 次提交
  9. 11 9月, 2005 1 次提交
  10. 08 9月, 2005 3 次提交
    • P
      [PATCH] cpusets: formalize intermediate GFP_KERNEL containment · 9bf2229f
      Paul Jackson 提交于
      This patch makes use of the previously underutilized cpuset flag
      'mem_exclusive' to provide what amounts to another layer of memory placement
      resolution.  With this patch, there are now the following four layers of
      memory placement available:
      
       1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
       2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
       3) The current tasks cpuset (GFP_USER allocations constrained to here), and
       4) Specific node placement, using mbind and set_mempolicy.
      
      These nest - each layer is a subset (same or within) of the previous.
      
      Layer (2) above is new, with this patch.  The call used to check whether a
      zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
      extended to take a gfp_mask argument, and its logic is extended, in the case
      that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
      hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
      placement is allowed.  The definition of GFP_USER, which used to be identical
      to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
      cpuset_gfp_hardwall_flag patch.
      
      GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
      cpuset, so long as any node therein is not too tight on memory, but will
      escape to the larger layer, if need be.
      
      The intended use is to allow something like a batch manager to handle several
      jobs, each job in its own cpuset, but using common kernel memory for caches
      and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
      task in or below one mem_exclusive cpuset should not cause swapping on nodes
      in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
      task in another such cpuset.  Heavy use of kernel memory for i/o caching and
      such by one job should not impact the memory available to jobs in other
      non-overlapping mem_exclusive cpusets.
      
      This patch enables providing hardwall, inescapable cpusets for memory
      allocations of each job, while sharing kernel memory allocations between
      several jobs, in an enclosing mem_exclusive cpuset.
      
      Like Dinakar's patch earlier to enable administering sched domains using the
      cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
      that had previously done nothing much useful other than restrict what cpuset
      configurations were allowed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9bf2229f
    • R
      [PATCH] Additions to .data.read_mostly section · 6c231b7b
      Ravikiran G Thirumalai 提交于
      Mark variables which are usually accessed for reads with __readmostly.
      Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6c231b7b
    • C
      [PATCH] More __read_mostly variables · c3d8c141
      Christoph Lameter 提交于
      Move some more frequently read variables that showed up during some of our
      performance tests as sometimes ending up in hot cachelines to the
      read_mostly section.
      
      Fix: Move the __read_mostly from before hpet_usec_quotient to follow the
      variable like the other uses of __read_mostly.
      Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com>
      Signed-off-by: NChristoph Lameter <christoph@scalex86.org>
      Signed-off-by: NShai Fultheim <shai@scalex86.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c3d8c141
  11. 05 9月, 2005 3 次提交
  12. 31 7月, 2005 1 次提交
  13. 28 7月, 2005 1 次提交
    • A
      [PATCH] Remove bogus warning in page_alloc.c · 12b1c5f3
      Andy Whitcroft 提交于
      Originally __free_pages_bulk used the relative page number within a zone to
      define its buddies.  This meant that to maintain the "maximally aligned"
      requirements (that an allocation of size N will be aligned at least to N
      physically) zones had to also be aligned to 1<<MAX_ORDER pages.  When
      __free_pages_bulk was updated to use the relative page frame numbers of the
      free'd pages to pair buddies this released the alignment constraint on the
      'left' edge of the zone.  This allows _either_ edge of the zone to contain
      partial MAX_ORDER sized buddies.  These simply never will have matching
      buddies and thus will never make it to the 'top' of the pyramid.
      
      The patch below removes a now redundant check ensuring that the mem_map was
      aligned to MAX_ORDER.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <christoph@lameter.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      12b1c5f3
  14. 08 7月, 2005 2 次提交
  15. 28 6月, 2005 1 次提交
  16. 24 6月, 2005 6 次提交
    • A
      [PATCH] sparsemem hotplug base · 29751f69
      Andy Whitcroft 提交于
      Make sparse's initalization be accessible at runtime.  This allows sparse
      mappings to be created after boot in a hotplug situation.
      
      This patch is separated from the previous one just to give an indication how
      much of the sparse infrastructure is *just* for hotplug memory.
      
      The section_mem_map doesn't really store a pointer.  It stores something that
      is convenient to do some math against to get a pointer.  It isn't valid to
      just do *section_mem_map, so I don't think it should be stored as a pointer.
      
      There are a couple of things I'd like to store about a section.  First of all,
      the fact that it is !NULL does not mean that it is present.  There could be
      such a combination where section_mem_map *is* NULL, but the math gets you
      properly to a real mem_map.  So, I don't think that check is safe.
      
      Since we're storing 32-bit-aligned structures, we have a few bits in the
      bottom of the pointer to play with.  Use one bit to encode whether there's
      really a mem_map there, and the other one to tell whether there's a valid
      section there.  We need to distinguish between the two because sometimes
      there's a gap between when a section is discovered to be present and when we
      can get the mem_map for it.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      29751f69
    • A
      [PATCH] sparsemem swiss cheese numa layouts · 641c7673
      Andy Whitcroft 提交于
      The part of the sparsemem patch which modifies memmap_init_zone() has recently
      become a problem.  It changes behavior so that there is a call to
      pfn_to_page() for each individual page inside of a node's range:
      node_start_pfn through node_end_pfn.  It used to simply do this once, at the
      beginning of the node, but having sparsemem's non-contiguous mem_map[]s inside
      of a node made it necessary to change.
      
      Mike Kravetz recently wrote a patch which made the NUMA code accept some new
      kinds of layouts.  The system's memory was laid out like this, with node 0's
      memory in two pieces: one before and one after node 1's memory:
      
      	Node 0: +++++     +++++
      	Node 1:      +++++
      
      Previous behavior before Mike's patch was to assign nodes like this:
      
      	Node 0: 00000     XXXXX
      	Node 1:      11111
      
      Where the 'X' areas were simply thrown away.  The new behavior was to make the
      pg_data_t span node 0 across all of its areas, including areas that are really
      node 1's: Node 0: 000000000000000 Node 1: 11111
      
      This wastes a little bit of mem_map space, but ends up being OK, and more
      fully utilizes the system's memory.  memmap_init_zone() initializes all of the
      "struct page"s for node 0, even for the "hole", but those never get used,
      because there is no pfn_to_page() that resolves to those pages.  However, only
      calling pfn_to_page() once, memmap_init_zone() always uses the pages that were
      allocated for node0->node_mem_map because:
      
      	struct page *start = pfn_to_page(start_pfn);
      	// effectively start = &node->node_mem_map[0]
      	for (page = start; page < (start + size); page++) {
      		init_page_here();...
      		page++;
      	}
      
      Slow, and wasteful, but generally harmless.
      
      But, modify that to call pfn_to_page() for each loop iteration (like sparsemem
      does):
      
      	for (pfn = start_pfn; pfn < < (start_pfn + size); pfn++++) {
      		page = pfn_to_page(pfn);
      	}
      
      And you end up trying to initialize node 1's pages too early, along with bogus
      data from node 0.  This patch checks for those weird layouts and declines to
      touch the pages, making the more frequent pfn_to_page() calls OK to do.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      641c7673
    • A
      [PATCH] sparsemem memory model · d41dee36
      Andy Whitcroft 提交于
      Sparsemem abstracts the use of discontiguous mem_maps[].  This kind of
      mem_map[] is needed by discontiguous memory machines (like in the old
      CONFIG_DISCONTIGMEM case) as well as memory hotplug systems.  Sparsemem
      replaces DISCONTIGMEM when enabled, and it is hoped that it can eventually
      become a complete replacement.
      
      A significant advantage over DISCONTIGMEM is that it's completely separated
      from CONFIG_NUMA.  When producing this patch, it became apparent in that NUMA
      and DISCONTIG are often confused.
      
      Another advantage is that sparse doesn't require each NUMA node's ranges to be
      contiguous.  It can handle overlapping ranges between nodes with no problems,
      where DISCONTIGMEM currently throws away that memory.
      
      Sparsemem uses an array to provide different pfn_to_page() translations for
      each SECTION_SIZE area of physical memory.  This is what allows the mem_map[]
      to be chopped up.
      
      In order to do quick pfn_to_page() operations, the section number of the page
      is encoded in page->flags.  Part of the sparsemem infrastructure enables
      sharing of these bits more dynamically (at compile-time) between the
      page_zone() and sparsemem operations.  However, on 32-bit architectures, the
      number of bits is quite limited, and may require growing the size of the
      page->flags type in certain conditions.  Several things might force this to
      occur: a decrease in the SECTION_SIZE (if you want to hotplug smaller areas of
      memory), an increase in the physical address space, or an increase in the
      number of used page->flags.
      
      One thing to note is that, once sparsemem is present, the NUMA node
      information no longer needs to be stored in the page->flags.  It might provide
      speed increases on certain platforms and will be stored there if there is
      room.  But, if out of room, an alternate (theoretically slower) mechanism is
      used.
      
      This patch introduces CONFIG_FLATMEM.  It is used in almost all cases where
      there used to be an #ifndef DISCONTIG, because SPARSEMEM and DISCONTIGMEM
      often have to compile out the same areas of code.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NMartin Bligh <mbligh@aracnet.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NBob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d41dee36
    • D
      [PATCH] Introduce new Kconfig option for NUMA or DISCONTIG · 93b7504e
      Dave Hansen 提交于
      There is some confusion that arose when working on SPARSEMEM patch between
      what is needed for DISCONTIG vs. NUMA.
      
      Multiple pg_data_t's are needed for DISCONTIGMEM or NUMA, independently.
      All of the current NUMA implementations require an implementation of
      DISCONTIG.  Because of this, quite a lot of code which is really needed for
      NUMA is actually under DISCONTIG #ifdefs.  For SPARSEMEM, we changed some
      of these #ifdefs to CONFIG_NUMA, but that broke the DISCONTIG=y and NUMA=n
      case.
      
      Introducing this new NEED_MULTIPLE_NODES config option allows code that is
      needed for both NUMA or DISCONTIG to be separated out from code that is
      specific to DISCONTIG.
      
      One great advantage of this approach is that it doesn't require every
      architecture to be converted over.  All of the current implementations
      should "just work", only the ones implementing SPARSEMEM will have to be
      fixed up.
      
      The change to free_area_init() makes it work inside, or out of the new
      config option.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      93b7504e
    • D
      [PATCH] sparsemem base: reorganize page->flags bit operations · 348f8b6c
      Dave Hansen 提交于
      Generify the value fields in the page_flags.  The aim is to allow the location
      and size of these fields to be varied.  Additionally we want to move away from
      fixed allocations per field whilst still enforcing the overall bit utilisation
      limits.  We rely on the compiler to spot and optimise the accessor functions.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      348f8b6c
    • D
      [PATCH] sparsemem base: simple NUMA remap space allocator · 6f167ec7
      Dave Hansen 提交于
      Introduce a simple allocator for the NUMA remap space.  This space is very
      scarce, used for structures which are best allocated node local.
      
      This mechanism is also used on non-NUMA ia64 systems with a vmem_map to keep
      the pgdat->node_mem_map initialized in a consistent place for all
      architectures.
      
      Issues:
      o alloc_remap takes a node_id where we might expect a pgdat which was intended
        to allow us to allocate the pgdat's using this mechanism; which we do not yet
        do.  Could have alloc_remap_node() and alloc_remap_nid() for this purpose.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6f167ec7
  17. 23 6月, 2005 1 次提交
  18. 22 6月, 2005 2 次提交