1. 08 7月, 2008 4 次提交
  2. 13 6月, 2008 1 次提交
  3. 10 6月, 2008 1 次提交
    • Y
      mm, x86: shrink_active_range() should check all · cc1a9d86
      Yinghai Lu 提交于
      Now we are using register_e820_active_regions() instead of
      add_active_range() directly. So end_pfn could be different between the
      value in early_node_map to node_end_pfn.
      
      So we need to make shrink_active_range() smarter.
      
      shrink_active_range() is a generic MM function in mm/page_alloc.c but
      it is only used on 32-bit x86. Should we move it back to some file in
      arch/x86?
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc1a9d86
  4. 29 4月, 2008 2 次提交
  5. 28 4月, 2008 8 次提交
    • L
      mempolicy: document {set|get}_policy() vm_ops APIs · a6020ed7
      Lee Schermerhorn 提交于
      Document mempolicy return value reference semantics assumed by the rest of the
      mempolicy code for the set_ and get_policy vm_ops in <linux/mm.h>--where the
      prototypes are defined--to inform any future mempolicy vm_op writers what the
      rest of the subsystem expects of them.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6020ed7
    • N
      mm: add vm_insert_mixed · 423bad60
      Nick Piggin 提交于
      vm_insert_mixed will insert either a raw pfn or a refcounted struct page into
      the page tables, depending on whether vm_normal_page() will return the page or
      not.  With the introduction of the new pte bit, this is now a too tricky for
      drivers to be doing themselves.
      
      filemap_xip uses this in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      423bad60
    • N
      mm: introduce pte_special pte bit · 7e675137
      Nick Piggin 提交于
      s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
      model (which is more dynamic than most).  Instead, they had proposed to
      implement it with an additional path through vm_normal_page(), using a bit in
      the pte to determine whether or not the page should be refcounted:
      
      vm_normal_page()
      {
      	...
              if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                      if (vma->vm_flags & VM_MIXEDMAP) {
      #ifdef s390
      			if (!mixedmap_refcount_pte(pte))
      				return NULL;
      #else
                              if (!pfn_valid(pfn))
                                      return NULL;
      #endif
                              goto out;
                      }
      	...
      }
      
      This is fine, however if we are allowed to use a bit in the pte to determine
      refcountedness, we can use that to _completely_ replace all the vma based
      schemes.  So instead of adding more cases to the already complex vma-based
      scheme, we can have a clearly seperate and simple pte-based scheme (and get
      slightly better code generation in the process):
      
      vm_normal_page()
      {
      #ifdef s390
      	if (!mixedmap_refcount_pte(pte))
      		return NULL;
      	return pte_page(pte);
      #else
      	...
      #endif
      }
      
      And finally, we may rather make this concept usable by any architecture rather
      than making it s390 only, so implement a new type of pte state for this.
      Unfortunately the old vma based code must stay, because some architectures may
      not be able to spare pte bits.  This makes vm_normal_page a little bit more
      ugly than we would like, but the 2 cases are clearly seperate.
      
      So introduce a pte_special pte state, and use it in mm/memory.c.  It is
      currently a noop for all architectures, so this doesn't actually result in any
      compiled code changes to mm/memory.o.
      
      BTW:
      I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
      The reason is that, regardless of where vm_normal_page is actually
      implemented, the *abstraction* is still exactly the same. Also, while it
      depends on whether the architecture has pte_special or not, that is the
      only two possible cases, and it really isn't an arch specific function --
      the role of the arch code should be to provide primitive functions and
      accessors with which to build the core code; pte_special does that. We do
      not want architectures to know or care about vm_normal_page itself, and
      we definitely don't want them being able to invent something new there
      out of sight of mm/ code. If we made vm_normal_page an arch function, then
      we have to make vm_insert_mixed (next patch) an arch function too. So I
      don't think moving it to arch code fundamentally improves any abstractions,
      while it does practically make the code more difficult to follow, for both
      mm and arch developers, and easier to misuse.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e675137
    • J
      mm: introduce VM_MIXEDMAP · b379d790
      Jared Hulbert 提交于
      This series introduces some important infrastructure work.  The overall result
      is that:
      
      1. We now support XIP backed filesystems using memory that have no
         struct page allocated to them. And patches 6 and 7 actually implement
         this for s390.
      
         This is pretty important in a number of cases. As far as I understand,
         in the case of virtualisation (eg. s390), each guest may mount a
         readonly copy of the same filesystem (eg. the distro). Currently,
         guests need to allocate struct pages for this image. So if you have
         100 guests, you already need to allocate more memory for the struct
         pages than the size of the image. I think. (Carsten?)
      
         For other (eg. embedded) systems, you may have a very large non-
         volatile filesystem. If you have to have struct pages for this, then
         your RAM consumption will go up proportionally to fs size. Even
         though it is just a small proportion, the RAM can be much more costly
         eg in terms of power, so every KB less that Linux uses makes it more
         attractive to a lot of these guys.
      
      2. VM_MIXEDMAP allows us to support mappings where you actually do want
         to refcount _some_ pages in the mapping, but not others, and support
         COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
         filesystem in progress. Future iterations of this filesystem will
         most likely want to migrate pages between pagecache and XIP backing,
         which is where the requirement for mixed (some refcounted, some not)
         comes from.
      
      3. pte_special also has a peripheral usage that I need for my lockless
         get_user_pages patch. That was shown to speed up "oltp" on db2 by
         10% on a 2 socket system, which is kind of significant because they
         scrounge for months to try to find 0.1% improvement on these
         workloads. I'm hoping we might finally be faster than AIX on
         pSeries with this :). My reference to lockless get_user_pages is not
         meant to justify this patchset (which doesn't include lockless gup),
         but just to show that pte_special is not some s390 specific thing that
         should be hidden in arch code or xip code: I definitely want to use it
         on at least x86 and powerpc as well.
      
      This patch:
      
      Introduce a new type of mapping, VM_MIXEDMAP.  This is unlike VM_PFNMAP in
      that it can support COW mappings of arbitrary ranges including ranges without
      struct page *and* ranges with a struct page that we actually want to refcount
      (PFNMAP can only support COW in those cases where the un-COW-ed translations
      are mapped linearly in the virtual address, and can only support non
      refcounted ranges).
      
      VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
      refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
      needs to avoid refcounting pfn_valid pages eg.  for /dev/mem mappings).
      Signed-off-by: NJared Hulbert <jaredeh@gmail.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b379d790
    • C
      pageflags: get rid of FLAGS_RESERVED · 9223b419
      Christoph Lameter 提交于
      NR_PAGEFLAGS specifies the number of page flags we are using.  From that we
      can calculate the number of bits leftover that can be used for zone, node (and
      maybe the sections id).  There is no need anymore for FLAGS_RESERVED if we use
      NR_PAGEFLAGS.
      
      Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
      NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
      These field widths have to be available to the preprocessor.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9223b419
    • A
      page_mapping(): add ifdef around reference to swapper_space · 726b8012
      Andrew Morton 提交于
      This fixes the superh build when the pageflags patches are applied.
      
      But it shouldn't unless it's a gcc bug.
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      726b8012
    • C
      sparsemem: vmemmap does not need section bits · 308c05e3
      Christoph Lameter 提交于
      A set of patches that attempts to improve page flag handling.  First of all a
      method is introduced to generate the page flag functions using macros.  Then
      the number of page flags used by sparsemem is reduced.  All page flag
      operations will no longer be macros.  All flags will use inline function.
      
      Then we add a way to export enum constants to the preprocessor which allows us
      to get rid of __ZONE_COUNT and use the NR_PAGEFLAGS for the dynamic
      calculation of actually available page flags for fields.
      
      This patch:
      
      Sparsemem vmemmap does not need any section bits.  This patch has the effect
      of reducing the number of bits used in page->flags by at least 6.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      308c05e3
    • N
      mm: remove nopage · 3c18ddd1
      Nick Piggin 提交于
      Nothing in the tree uses nopage any more.  Remove support for it in the
      core mm code and documentation (and a few stray references to it in
      comments).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c18ddd1
  6. 27 4月, 2008 1 次提交
    • Y
      x86_64/mm: check and print vmemmap allocation continuous · c2b91e2e
      Yinghai Lu 提交于
      On big systems with lots of memory, don't print out too much during
      bootup, and make it easy to find if it is continuous.
      
      on 256G 8 sockets system will get
       [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
      [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
       [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
      [ffffe20038700000-ffffe200387fffff] potential offnode page_structs
       [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
       [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
       [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
      [ffffe20054700000-ffffe200547fffff] potential offnode page_structs
       [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
      [ffffe20070700000-ffffe200707fffff] potential offnode page_structs
       [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
       [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
       [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
      [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
       [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
      [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
       [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
       [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
       [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
      [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
       [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
       [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7
      
      instead of a very long print out...
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c2b91e2e
  7. 13 3月, 2008 1 次提交
    • P
      nommu: Provide is_vmalloc_addr() stub. · 0738c4bb
      Paul Mundt 提交于
      Introduced in commit-id 9e2779fa and
      ifdef'ed out for nommu in 8ca3ed87, both
      approaches end up breaking the nommu build in different ways. An
      impressive feat for a 2-liner.
      
      Current is_vmalloc_addr() users fall in to two camps:
      
      	- Determining whether to use vfree()/kfree()
      	- Whether to do vmlist traversal (only /proc/kcore).
      
      Since we don't support /proc/kcore on nommu, that leaves the
      vfree()/kfree() determination use cases. nommu vfree() happens to be a
      wrapper to kfree() anyways, so is_vmalloc_addr() can always return 0
      and end up with the right behaviour.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0738c4bb
  8. 24 2月, 2008 1 次提交
  9. 21 2月, 2008 1 次提交
  10. 14 2月, 2008 1 次提交
  11. 09 2月, 2008 1 次提交
    • M
      CONFIG_HIGHPTE vs. sub-page page tables. · 2f569afd
      Martin Schwidefsky 提交于
      Background: I've implemented 1K/2K page tables for s390.  These sub-page
      page tables are required to properly support the s390 virtualization
      instruction with KVM.  The SIE instruction requires that the page tables
      have 256 page table entries (pte) followed by 256 page status table entries
      (pgste).  The pgstes are only required if the process is using the SIE
      instruction.  The pgstes are updated by the hardware and by the hypervisor
      for a number of reasons, one of them is dirty and reference bit tracking.
      To avoid wasting memory the standard pte table allocation should return
      1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
      
      Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
      the s390 version for pte_alloc_one cannot return a pointer to a struct
      page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
      cannot return a pointer to a pte either, since that would require more than
      32 bit for the return value of pte_alloc_one (and the pte * would not be
      accessible since its not kmapped).
      
      Solution: The only solution I found to this dilemma is a new typedef: a
      pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
      later patch.  For everybody else it will be a (struct page *).  The
      additional problem with the initialization of the ptl lock and the
      NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
      a destructor pgtable_page_dtor.  The page table allocation and free
      functions need to call these two whenever a page table page is allocated or
      freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
       To get the pgtable_t back from a pmd entry that has been installed with
      pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
      call in free_pte_range and apply_to_pte_range.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f569afd
  12. 06 2月, 2008 5 次提交
  13. 30 1月, 2008 2 次提交
  14. 25 1月, 2008 1 次提交
  15. 05 12月, 2007 1 次提交
    • E
      Security: round mmap hint address above mmap_min_addr · 7cd94146
      Eric Paris 提交于
      If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
      non-null hint address less than mmap_min_addr the mapping will fail the
      security checks.  Since this is just a hint address this patch will round
      such a hint address above mmap_min_addr.
      
      gcj was found to try to be very frugal with vm usage and give hint addresses
      in the 8k-32k range.  Without this patch all such programs failed and with
      the patch they happily get a higher address.
      
      This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
      without it and there would be no security check possible no matter what.  So
      we should not bother compiling in this rounding if it is just a waste of
      time.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7cd94146
  16. 17 10月, 2007 9 次提交
    • A
      Drop some headers from mm.h · 4af3c9cc
      Alexey Dobriyan 提交于
      mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
      remove them and add them back to files which need them.
      
      Cross-compile tested on many configs and archs.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4af3c9cc
    • A
      mm/shmem.c: make 3 functions static · d8dc74f2
      Adrian Bunk 提交于
      This patch makes three needlessly global functions static.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8dc74f2
    • Y
      memory hotplug: Hot-add with sparsemem-vmemmap · 98f3cfc1
      Yasunori Goto 提交于
      This patch is to avoid panic when memory hot-add is executed with
      sparsemem-vmemmap.  Current vmemmap-sparsemem code doesn't support memory
      hot-add.  Vmemmap must be populated when hot-add.  This is for
      2.6.23-rc2-mm2.
      
      Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential
              offnode page_structs" is displayed. To allocate memmap on its node,
              memmap (and pgdat) must be initialized itself like chicken and
              egg relationship.
      
            # vmemmap_unpopulate will be necessary for followings.
               - For cancel hot-add due to error.
               - For unplug.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98f3cfc1
    • C
      SLUB: Do not use page->mapping · 8e65d24c
      Christoph Lameter 提交于
      After moving the lockless_freelist to kmem_cache_cpu we no longer need
      page->lockless_freelist. Restructure the use of the struct page fields in
      such a way that we never touch the mapping field.
      
      This is turn allows us to remove the special casing of SLUB when determining
      the mapping of a page (needed for corner cases of virtual caches machines that
      need to flush caches of processors mapping a page).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e65d24c
    • M
      move mm_struct and vm_area_struct · c92ff1bd
      Martin Schwidefsky 提交于
      Move the definitions of struct mm_struct and struct vma_area_struct to
      include/mm_types.h.  This allows to define more function in asm/pgtable.h
      and friends with inline assemblies instead of macros.  Compile tested on
      i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.
      
      [aurelien@aurel32.net: build fix]
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c92ff1bd
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
    • F
      readahead: remove several readahead macros · 535443f5
      Fengguang Wu 提交于
      Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535443f5
    • A
      vmemmap: generify initialisation via helpers · 29c71111
      Andy Whitcroft 提交于
      Convert the common vmemmap population into initialisation helpers for use by
      architecture vmemmap populators.  All architecture implementing the
      SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
      initialiser, which may make use of the helpers.
      
      This allows us to clean up and remove the initialisation Kconfig entries.
      With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
      indicate use of that variant.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c71111
    • C
      Generic Virtual Memmap support for SPARSEMEM · 8f6aac41
      Christoph Lameter 提交于
      SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
      the arches.  It would be great if it could be the default so that we can get
      rid of various forms of DISCONTIG and other variations on memory maps.  So far
      what has hindered this are the additional lookups that SPARSEMEM introduces
      for virt_to_page and page_address.  This goes so far that the code to do this
      has to be kept in a separate function and cannot be used inline.
      
      This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
      is mapped into a virtually contigious area, only the active sections are
      physically backed.  This allows virt_to_page page_address and cohorts become
      simple shift/add operations.  No page flag fields, no table lookups, nothing
      involving memory is required.
      
      The two key operations pfn_to_page and page_to_page become:
      
         #define __pfn_to_page(pfn)      (vmemmap + (pfn))
         #define __page_to_pfn(page)     ((page) - vmemmap)
      
      By having a virtual mapping for the memmap we allow simple access without
      wasting physical memory.  As kernel memory is typically already mapped 1:1
      this introduces no additional overhead.  The virtual mapping must be big
      enough to allow a struct page to be allocated and mapped for all valid
      physical pages.  This vill make a virtual memmap difficult to use on 32 bit
      platforms that support 36 address bits.
      
      However, if there is enough virtual space available and the arch already maps
      its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
      technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
      FLATMEM needs to read the contents of the mem_map variable to get the start of
      the memmap and then add the offset to the required entry.  vmemmap is a
      constant to which we can simply add the offset.
      
      This patch has the potential to allow us to make SPARSMEM the default (and
      even the only) option for most systems.  It should be optimal on UP, SMP and
      NUMA on most platforms.  Then we may even be able to remove the other memory
      models: FLATMEM, DISCONTIG etc.
      
      [apw@shadowen.org: config cleanups, resplit code etc]
      [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
      [apw@shadowen.org: vmemmap: remove excess debugging]
      [apw@shadowen.org: simplify initialisation code and reduce duplication]
      [apw@shadowen.org: pull out the vmemmap code into its own file]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f6aac41