1. 28 4月, 2008 6 次提交
    • N
      mm: introduce pte_special pte bit · 7e675137
      Nick Piggin 提交于
      s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
      model (which is more dynamic than most).  Instead, they had proposed to
      implement it with an additional path through vm_normal_page(), using a bit in
      the pte to determine whether or not the page should be refcounted:
      
      vm_normal_page()
      {
      	...
              if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                      if (vma->vm_flags & VM_MIXEDMAP) {
      #ifdef s390
      			if (!mixedmap_refcount_pte(pte))
      				return NULL;
      #else
                              if (!pfn_valid(pfn))
                                      return NULL;
      #endif
                              goto out;
                      }
      	...
      }
      
      This is fine, however if we are allowed to use a bit in the pte to determine
      refcountedness, we can use that to _completely_ replace all the vma based
      schemes.  So instead of adding more cases to the already complex vma-based
      scheme, we can have a clearly seperate and simple pte-based scheme (and get
      slightly better code generation in the process):
      
      vm_normal_page()
      {
      #ifdef s390
      	if (!mixedmap_refcount_pte(pte))
      		return NULL;
      	return pte_page(pte);
      #else
      	...
      #endif
      }
      
      And finally, we may rather make this concept usable by any architecture rather
      than making it s390 only, so implement a new type of pte state for this.
      Unfortunately the old vma based code must stay, because some architectures may
      not be able to spare pte bits.  This makes vm_normal_page a little bit more
      ugly than we would like, but the 2 cases are clearly seperate.
      
      So introduce a pte_special pte state, and use it in mm/memory.c.  It is
      currently a noop for all architectures, so this doesn't actually result in any
      compiled code changes to mm/memory.o.
      
      BTW:
      I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
      The reason is that, regardless of where vm_normal_page is actually
      implemented, the *abstraction* is still exactly the same. Also, while it
      depends on whether the architecture has pte_special or not, that is the
      only two possible cases, and it really isn't an arch specific function --
      the role of the arch code should be to provide primitive functions and
      accessors with which to build the core code; pte_special does that. We do
      not want architectures to know or care about vm_normal_page itself, and
      we definitely don't want them being able to invent something new there
      out of sight of mm/ code. If we made vm_normal_page an arch function, then
      we have to make vm_insert_mixed (next patch) an arch function too. So I
      don't think moving it to arch code fundamentally improves any abstractions,
      while it does practically make the code more difficult to follow, for both
      mm and arch developers, and easier to misuse.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e675137
    • J
      mm: introduce VM_MIXEDMAP · b379d790
      Jared Hulbert 提交于
      This series introduces some important infrastructure work.  The overall result
      is that:
      
      1. We now support XIP backed filesystems using memory that have no
         struct page allocated to them. And patches 6 and 7 actually implement
         this for s390.
      
         This is pretty important in a number of cases. As far as I understand,
         in the case of virtualisation (eg. s390), each guest may mount a
         readonly copy of the same filesystem (eg. the distro). Currently,
         guests need to allocate struct pages for this image. So if you have
         100 guests, you already need to allocate more memory for the struct
         pages than the size of the image. I think. (Carsten?)
      
         For other (eg. embedded) systems, you may have a very large non-
         volatile filesystem. If you have to have struct pages for this, then
         your RAM consumption will go up proportionally to fs size. Even
         though it is just a small proportion, the RAM can be much more costly
         eg in terms of power, so every KB less that Linux uses makes it more
         attractive to a lot of these guys.
      
      2. VM_MIXEDMAP allows us to support mappings where you actually do want
         to refcount _some_ pages in the mapping, but not others, and support
         COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
         filesystem in progress. Future iterations of this filesystem will
         most likely want to migrate pages between pagecache and XIP backing,
         which is where the requirement for mixed (some refcounted, some not)
         comes from.
      
      3. pte_special also has a peripheral usage that I need for my lockless
         get_user_pages patch. That was shown to speed up "oltp" on db2 by
         10% on a 2 socket system, which is kind of significant because they
         scrounge for months to try to find 0.1% improvement on these
         workloads. I'm hoping we might finally be faster than AIX on
         pSeries with this :). My reference to lockless get_user_pages is not
         meant to justify this patchset (which doesn't include lockless gup),
         but just to show that pte_special is not some s390 specific thing that
         should be hidden in arch code or xip code: I definitely want to use it
         on at least x86 and powerpc as well.
      
      This patch:
      
      Introduce a new type of mapping, VM_MIXEDMAP.  This is unlike VM_PFNMAP in
      that it can support COW mappings of arbitrary ranges including ranges without
      struct page *and* ranges with a struct page that we actually want to refcount
      (PFNMAP can only support COW in those cases where the un-COW-ed translations
      are mapped linearly in the virtual address, and can only support non
      refcounted ranges).
      
      VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
      refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
      needs to avoid refcounting pfn_valid pages eg.  for /dev/mem mappings).
      Signed-off-by: NJared Hulbert <jaredeh@gmail.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b379d790
    • C
      pageflags: get rid of FLAGS_RESERVED · 9223b419
      Christoph Lameter 提交于
      NR_PAGEFLAGS specifies the number of page flags we are using.  From that we
      can calculate the number of bits leftover that can be used for zone, node (and
      maybe the sections id).  There is no need anymore for FLAGS_RESERVED if we use
      NR_PAGEFLAGS.
      
      Use the new methods to make NR_PAGEFLAGS available via the preprocessor.
      NR_PAGEFLAGS is used to calculate field boundaries in the page flags fields.
      These field widths have to be available to the preprocessor.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9223b419
    • A
      page_mapping(): add ifdef around reference to swapper_space · 726b8012
      Andrew Morton 提交于
      This fixes the superh build when the pageflags patches are applied.
      
      But it shouldn't unless it's a gcc bug.
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      726b8012
    • C
      sparsemem: vmemmap does not need section bits · 308c05e3
      Christoph Lameter 提交于
      A set of patches that attempts to improve page flag handling.  First of all a
      method is introduced to generate the page flag functions using macros.  Then
      the number of page flags used by sparsemem is reduced.  All page flag
      operations will no longer be macros.  All flags will use inline function.
      
      Then we add a way to export enum constants to the preprocessor which allows us
      to get rid of __ZONE_COUNT and use the NR_PAGEFLAGS for the dynamic
      calculation of actually available page flags for fields.
      
      This patch:
      
      Sparsemem vmemmap does not need any section bits.  This patch has the effect
      of reducing the number of bits used in page->flags by at least 6.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      308c05e3
    • N
      mm: remove nopage · 3c18ddd1
      Nick Piggin 提交于
      Nothing in the tree uses nopage any more.  Remove support for it in the
      core mm code and documentation (and a few stray references to it in
      comments).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c18ddd1
  2. 27 4月, 2008 1 次提交
    • Y
      x86_64/mm: check and print vmemmap allocation continuous · c2b91e2e
      Yinghai Lu 提交于
      On big systems with lots of memory, don't print out too much during
      bootup, and make it easy to find if it is continuous.
      
      on 256G 8 sockets system will get
       [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
      [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
       [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
      [ffffe20038700000-ffffe200387fffff] potential offnode page_structs
       [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
       [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
       [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
      [ffffe20054700000-ffffe200547fffff] potential offnode page_structs
       [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
      [ffffe20070700000-ffffe200707fffff] potential offnode page_structs
       [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
       [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
       [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
      [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
       [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
      [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
       [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
       [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
       [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
      [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
       [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
       [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7
      
      instead of a very long print out...
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c2b91e2e
  3. 13 3月, 2008 1 次提交
    • P
      nommu: Provide is_vmalloc_addr() stub. · 0738c4bb
      Paul Mundt 提交于
      Introduced in commit-id 9e2779fa and
      ifdef'ed out for nommu in 8ca3ed87, both
      approaches end up breaking the nommu build in different ways. An
      impressive feat for a 2-liner.
      
      Current is_vmalloc_addr() users fall in to two camps:
      
      	- Determining whether to use vfree()/kfree()
      	- Whether to do vmlist traversal (only /proc/kcore).
      
      Since we don't support /proc/kcore on nommu, that leaves the
      vfree()/kfree() determination use cases. nommu vfree() happens to be a
      wrapper to kfree() anyways, so is_vmalloc_addr() can always return 0
      and end up with the right behaviour.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0738c4bb
  4. 24 2月, 2008 1 次提交
  5. 21 2月, 2008 1 次提交
  6. 14 2月, 2008 1 次提交
  7. 09 2月, 2008 1 次提交
    • M
      CONFIG_HIGHPTE vs. sub-page page tables. · 2f569afd
      Martin Schwidefsky 提交于
      Background: I've implemented 1K/2K page tables for s390.  These sub-page
      page tables are required to properly support the s390 virtualization
      instruction with KVM.  The SIE instruction requires that the page tables
      have 256 page table entries (pte) followed by 256 page status table entries
      (pgste).  The pgstes are only required if the process is using the SIE
      instruction.  The pgstes are updated by the hardware and by the hypervisor
      for a number of reasons, one of them is dirty and reference bit tracking.
      To avoid wasting memory the standard pte table allocation should return
      1K/2K (31/64 bit) and 2K/4K if the process is using SIE.
      
      Problem: Page size on s390 is 4K, page table size is 1K or 2K.  That means
      the s390 version for pte_alloc_one cannot return a pointer to a struct
      page.  Trouble is that with the CONFIG_HIGHPTE feature on x86 pte_alloc_one
      cannot return a pointer to a pte either, since that would require more than
      32 bit for the return value of pte_alloc_one (and the pte * would not be
      accessible since its not kmapped).
      
      Solution: The only solution I found to this dilemma is a new typedef: a
      pgtable_t.  For s390 pgtable_t will be a (pte *) - to be introduced with a
      later patch.  For everybody else it will be a (struct page *).  The
      additional problem with the initialization of the ptl lock and the
      NR_PAGETABLE accounting is solved with a constructor pgtable_page_ctor and
      a destructor pgtable_page_dtor.  The page table allocation and free
      functions need to call these two whenever a page table page is allocated or
      freed.  pmd_populate will get a pgtable_t instead of a struct page pointer.
       To get the pgtable_t back from a pmd entry that has been installed with
      pmd_populate a new function pmd_pgtable is added.  It replaces the pmd_page
      call in free_pte_range and apply_to_pte_range.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f569afd
  8. 06 2月, 2008 5 次提交
  9. 30 1月, 2008 2 次提交
  10. 25 1月, 2008 1 次提交
  11. 05 12月, 2007 1 次提交
    • E
      Security: round mmap hint address above mmap_min_addr · 7cd94146
      Eric Paris 提交于
      If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
      non-null hint address less than mmap_min_addr the mapping will fail the
      security checks.  Since this is just a hint address this patch will round
      such a hint address above mmap_min_addr.
      
      gcj was found to try to be very frugal with vm usage and give hint addresses
      in the 8k-32k range.  Without this patch all such programs failed and with
      the patch they happily get a higher address.
      
      This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
      without it and there would be no security check possible no matter what.  So
      we should not bother compiling in this rounding if it is just a waste of
      time.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7cd94146
  12. 17 10月, 2007 9 次提交
    • A
      Drop some headers from mm.h · 4af3c9cc
      Alexey Dobriyan 提交于
      mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
      remove them and add them back to files which need them.
      
      Cross-compile tested on many configs and archs.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4af3c9cc
    • A
      mm/shmem.c: make 3 functions static · d8dc74f2
      Adrian Bunk 提交于
      This patch makes three needlessly global functions static.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8dc74f2
    • Y
      memory hotplug: Hot-add with sparsemem-vmemmap · 98f3cfc1
      Yasunori Goto 提交于
      This patch is to avoid panic when memory hot-add is executed with
      sparsemem-vmemmap.  Current vmemmap-sparsemem code doesn't support memory
      hot-add.  Vmemmap must be populated when hot-add.  This is for
      2.6.23-rc2-mm2.
      
      Todo: # Even if this patch is applied, the message "[xxxx-xxxx] potential
              offnode page_structs" is displayed. To allocate memmap on its node,
              memmap (and pgdat) must be initialized itself like chicken and
              egg relationship.
      
            # vmemmap_unpopulate will be necessary for followings.
               - For cancel hot-add due to error.
               - For unplug.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98f3cfc1
    • C
      SLUB: Do not use page->mapping · 8e65d24c
      Christoph Lameter 提交于
      After moving the lockless_freelist to kmem_cache_cpu we no longer need
      page->lockless_freelist. Restructure the use of the struct page fields in
      such a way that we never touch the mapping field.
      
      This is turn allows us to remove the special casing of SLUB when determining
      the mapping of a page (needed for corner cases of virtual caches machines that
      need to flush caches of processors mapping a page).
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e65d24c
    • M
      move mm_struct and vm_area_struct · c92ff1bd
      Martin Schwidefsky 提交于
      Move the definitions of struct mm_struct and struct vma_area_struct to
      include/mm_types.h.  This allows to define more function in asm/pgtable.h
      and friends with inline assemblies instead of macros.  Compile tested on
      i386, powerpc, powerpc64, s390-32, s390-64 and x86_64.
      
      [aurelien@aurel32.net: build fix]
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAurelien Jarno <aurelien@aurel32.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c92ff1bd
    • N
      remove ZERO_PAGE · 557ed1fa
      Nick Piggin 提交于
      The commit b5810039 contains the note
      
        A last caveat: the ZERO_PAGE is now refcounted and managed with rmap
        (and thus mapcounted and count towards shared rss).  These writes to
        the struct page could cause excessive cacheline bouncing on big
        systems.  There are a number of ways this could be addressed if it is
        an issue.
      
      And indeed this cacheline bouncing has shown up on large SGI systems.
      There was a situation where an Altix system was essentially livelocked
      tearing down ZERO_PAGE pagetables when an HPC app aborted during startup.
      This situation can be avoided in userspace, but it does highlight the
      potential scalability problem with refcounting ZERO_PAGE, and corner
      cases where it can really hurt (we don't want the system to livelock!).
      
      There are several broad ways to fix this problem:
      1. add back some special casing to avoid refcounting ZERO_PAGE
      2. per-node or per-cpu ZERO_PAGES
      3. remove the ZERO_PAGE completely
      
      I will argue for 3. The others should also fix the problem, but they
      result in more complex code than does 3, with little or no real benefit
      that I can see.
      
      Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a
      false optimisation: if an application is performance critical, it would
      not be doing many read faults of new memory, or at least it could be
      expected to write to that memory soon afterwards. If cache or memory use
      is critical, it should not be working with a significant number of
      ZERO_PAGEs anyway (a more compact representation of zeroes should be
      used).
      
      As a sanity check -- mesuring on my desktop system, there are never many
      mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not
      increase much without it.
      
      When running a make -j4 kernel compile on my dual core system, there are
      about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000
      ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second
      is torn down without being COWed). So removing ZERO_PAGE will save 1,000
      page faults per second when running kbuild, while keeping it only saves
      less than 1 page clearing operation per second. 1 page clear is cheaper
      than a thousand faults, presumably, so there isn't an obvious loss.
      
      Neither the logical argument nor these basic tests give a guarantee of no
      regressions. However, this is a reasonable opportunity to try to remove
      the ZERO_PAGE from the pagefault path. If it is found to cause regressions,
      we can reintroduce it and just avoid refcounting it.
      
      The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked.  I don't see
      much use to them except on benchmarks.  All other users of ZERO_PAGE are
      converted just to use ZERO_PAGE(0) for simplicity. We can look at
      replacing them all and maybe ripping out ZERO_PAGE completely when we are
      more satisfied with this solution.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
      557ed1fa
    • F
      readahead: remove several readahead macros · 535443f5
      Fengguang Wu 提交于
      Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      535443f5
    • A
      vmemmap: generify initialisation via helpers · 29c71111
      Andy Whitcroft 提交于
      Convert the common vmemmap population into initialisation helpers for use by
      architecture vmemmap populators.  All architecture implementing the
      SPARSEMEM_VMEMMAP variant supply an architecture specific vmemmap_populate()
      initialiser, which may make use of the helpers.
      
      This allows us to clean up and remove the initialisation Kconfig entries.
      With this patch there is a single SPARSEMEM_VMEMMAP_ENABLE Kconfig option to
      indicate use of that variant.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c71111
    • C
      Generic Virtual Memmap support for SPARSEMEM · 8f6aac41
      Christoph Lameter 提交于
      SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
      the arches.  It would be great if it could be the default so that we can get
      rid of various forms of DISCONTIG and other variations on memory maps.  So far
      what has hindered this are the additional lookups that SPARSEMEM introduces
      for virt_to_page and page_address.  This goes so far that the code to do this
      has to be kept in a separate function and cannot be used inline.
      
      This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
      is mapped into a virtually contigious area, only the active sections are
      physically backed.  This allows virt_to_page page_address and cohorts become
      simple shift/add operations.  No page flag fields, no table lookups, nothing
      involving memory is required.
      
      The two key operations pfn_to_page and page_to_page become:
      
         #define __pfn_to_page(pfn)      (vmemmap + (pfn))
         #define __page_to_pfn(page)     ((page) - vmemmap)
      
      By having a virtual mapping for the memmap we allow simple access without
      wasting physical memory.  As kernel memory is typically already mapped 1:1
      this introduces no additional overhead.  The virtual mapping must be big
      enough to allow a struct page to be allocated and mapped for all valid
      physical pages.  This vill make a virtual memmap difficult to use on 32 bit
      platforms that support 36 address bits.
      
      However, if there is enough virtual space available and the arch already maps
      its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
      technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
      FLATMEM needs to read the contents of the mem_map variable to get the start of
      the memmap and then add the offset to the required entry.  vmemmap is a
      constant to which we can simply add the offset.
      
      This patch has the potential to allow us to make SPARSMEM the default (and
      even the only) option for most systems.  It should be optimal on UP, SMP and
      NUMA on most platforms.  Then we may even be able to remove the other memory
      models: FLATMEM, DISCONTIG etc.
      
      [apw@shadowen.org: config cleanups, resplit code etc]
      [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
      [apw@shadowen.org: vmemmap: remove excess debugging]
      [apw@shadowen.org: simplify initialisation code and reduce duplication]
      [apw@shadowen.org: pull out the vmemmap code into its own file]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f6aac41
  13. 23 8月, 2007 1 次提交
    • A
      fix NULL pointer dereference in __vm_enough_memory() · 34b4e4aa
      Alan Cox 提交于
      The new exec code inserts an accounted vma into an mm struct which is not
      current->mm.  The existing memory check code has a hard coded assumption
      that this does not happen as does the security code.
      
      As the correct mm is known we pass the mm to the security method and the
      helper function.  A new security test is added for the case where we need
      to pass the mm and the existing one is modified to pass current->mm to
      avoid the need to change large amounts of code.
      
      (Thanks to Tobias for fixing rejects and testing)
      Signed-off-by: NAlan Cox <alan@redhat.com>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Cc: James Morris <jmorris@redhat.com>
      Cc: Tobias Diedrich <ranma+kernel@tdiedrich.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34b4e4aa
  14. 30 7月, 2007 1 次提交
    • A
      Remove fs.h from mm.h · 4e950f6f
      Alexey Dobriyan 提交于
      Remove fs.h from mm.h. For this,
       1) Uninline vma_wants_writenotify(). It's pretty huge anyway.
       2) Add back fs.h or less bloated headers (err.h) to files that need it.
      
      As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files
      rebuilt down to 3444 (-12.3%).
      
      Cross-compile tested without regressions on my two usual configs and (sigh):
      
      alpha              arm-mx1ads        mips-bigsur          powerpc-ebony
      alpha-allnoconfig  arm-neponset      mips-capcella        powerpc-g5
      alpha-defconfig    arm-netwinder     mips-cobalt          powerpc-holly
      alpha-up           arm-netx          mips-db1000          powerpc-iseries
      arm                arm-ns9xxx        mips-db1100          powerpc-linkstation
      arm-assabet        arm-omap_h2_1610  mips-db1200          powerpc-lite5200
      arm-at91rm9200dk   arm-onearm        mips-db1500          powerpc-maple
      arm-at91rm9200ek   arm-picotux200    mips-db1550          powerpc-mpc7448_hpc2
      arm-at91sam9260ek  arm-pleb          mips-ddb5477         powerpc-mpc8272_ads
      arm-at91sam9261ek  arm-pnx4008       mips-decstation      powerpc-mpc8313_rdb
      arm-at91sam9263ek  arm-pxa255-idp    mips-e55             powerpc-mpc832x_mds
      arm-at91sam9rlek   arm-realview      mips-emma2rh         powerpc-mpc832x_rdb
      arm-ateb9200       arm-realview-smp  mips-excite          powerpc-mpc834x_itx
      arm-badge4         arm-rpc           mips-fulong          powerpc-mpc834x_itxgp
      arm-carmeva        arm-s3c2410       mips-ip22            powerpc-mpc834x_mds
      arm-cerfcube       arm-shannon       mips-ip27            powerpc-mpc836x_mds
      arm-clps7500       arm-shark         mips-ip32            powerpc-mpc8540_ads
      arm-collie         arm-simpad        mips-jazz            powerpc-mpc8544_ds
      arm-corgi          arm-spitz         mips-jmr3927         powerpc-mpc8560_ads
      arm-csb337         arm-trizeps4      mips-malta           powerpc-mpc8568mds
      arm-csb637         arm-versatile     mips-mipssim         powerpc-mpc85xx_cds
      arm-ebsa110        i386              mips-mpc30x          powerpc-mpc8641_hpcn
      arm-edb7211        i386-allnoconfig  mips-msp71xx         powerpc-mpc866_ads
      arm-em_x270        i386-defconfig    mips-ocelot          powerpc-mpc885_ads
      arm-ep93xx         i386-up           mips-pb1100          powerpc-pasemi
      arm-footbridge     ia64              mips-pb1500          powerpc-pmac32
      arm-fortunet       ia64-allnoconfig  mips-pb1550          powerpc-ppc64
      arm-h3600          ia64-bigsur       mips-pnx8550-jbs     powerpc-prpmc2800
      arm-h7201          ia64-defconfig    mips-pnx8550-stb810  powerpc-ps3
      arm-h7202          ia64-gensparse    mips-qemu            powerpc-pseries
      arm-hackkit        ia64-sim          mips-rbhma4200       powerpc-up
      arm-integrator     ia64-sn2          mips-rbhma4500       s390
      arm-iop13xx        ia64-tiger        mips-rm200           s390-allnoconfig
      arm-iop32x         ia64-up           mips-sb1250-swarm    s390-defconfig
      arm-iop33x         ia64-zx1          mips-sead            s390-up
      arm-ixp2000        m68k              mips-tb0219          sparc
      arm-ixp23xx        m68k-amiga        mips-tb0226          sparc-allnoconfig
      arm-ixp4xx         m68k-apollo       mips-tb0287          sparc-defconfig
      arm-jornada720     m68k-atari        mips-workpad         sparc-up
      arm-kafa           m68k-bvme6000     mips-wrppmc          sparc64
      arm-kb9202         m68k-hp300        mips-yosemite        sparc64-allnoconfig
      arm-ks8695         m68k-mac          parisc               sparc64-defconfig
      arm-lart           m68k-mvme147      parisc-allnoconfig   sparc64-up
      arm-lpd270         m68k-mvme16x      parisc-defconfig     um-x86_64
      arm-lpd7a400       m68k-q40          parisc-up            x86_64
      arm-lpd7a404       m68k-sun3         powerpc              x86_64-allnoconfig
      arm-lubbock        m68k-sun3x        powerpc-cell         x86_64-defconfig
      arm-lusl7200       mips              powerpc-celleb       x86_64-up
      arm-mainstone      mips-atlas        powerpc-chrp32
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e950f6f
  15. 27 7月, 2007 1 次提交
    • S
      fix 'dynreloc miscount' link error on Powerpc · 045e72ac
      Sam Ravnborg 提交于
      Nathan Lynch <ntl@pobox.com> reported:
      2.6.23-rc1 breaks the build for 64-bit powerpc for me (using
      maple_defconfig):
      
        LD      vmlinux.o
      powerpc64-unknown-linux-gnu-ld: dynreloc miscount for
      kernel/built-in.o, section .opd
      powerpc64-unknown-linux-gnu-ld: can not edit opd Bad value
      make: *** [vmlinux.o] Error 1
      
      However, I see a possibly related binutils patch:
      http://article.gmane.org/gmane.comp.gnu.binutils/33650
      
      It was tracked down to be caused by the weak prototype
      declaration in mm.h:
      __attribute__((weak)) const char *arch_vma_name(struct vm_area_struct *vma);
      
      But there is no need to make the declaration weak - only the definition
      needs to be marked weak.  So drop the weak declaration.  And in the process
      drop the duplicate definition in page.h for powerpc.
      
      Note: the arch_vma_name fix for x86_64 needs to be applied first to avoid
      breaking x86_64
      Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
      Cc: Nathan Lynch <ntl@pobox.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      045e72ac
  16. 20 7月, 2007 7 次提交
    • O
      mm: variable length argument support · b6a2fea3
      Ollie Wild 提交于
      Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
      the old mm into the new mm.
      
      We create the new mm before the binfmt code runs, and place the new stack at
      the very top of the address space.  Once the binfmt code runs and figures out
      where the stack should be, we move it downwards.
      
      It is a bit peculiar in that we have one task with two mm's, one of which is
      inactive.
      
      [a.p.zijlstra@chello.nl: limit stack size]
      Signed-off-by: NOllie Wild <aaw@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      [bunk@stusta.de: unexport bprm_mm_init]
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6a2fea3
    • R
      readahead: split ondemand readahead interface into two functions · cf914a7d
      Rusty Russell 提交于
      Split ondemand readahead interface into two functions.  I think this makes it
      a little clearer for non-readahead experts (like Rusty).
      
      Internally they both call ondemand_readahead(), but the page argument is
      changed to an obvious boolean flag.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf914a7d
    • F
      readahead: remove the old algorithm · c743d96b
      Fengguang Wu 提交于
      Remove the old readahead algorithm.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c743d96b
    • F
      readahead: on-demand readahead logic · 122a21d1
      Fengguang Wu 提交于
      This is a minimal readahead algorithm that aims to replace the current one.
      It is more flexible and reliable, while maintaining almost the same behavior
      and performance.  Also it is full integrated with adaptive readahead.
      
      It is designed to be called on demand:
      	- on a missing page, to do synchronous readahead
      	- on a lookahead page, to do asynchronous readahead
      
      In this way it eliminated the awkward workarounds for cache hit/miss,
      readahead thrashing, retried read, and unaligned read.  It also adopts the
      data structure introduced by adaptive readahead, parameterizes readahead
      pipelining with `lookahead_index', and reduces the current/ahead windows to
      one single window.
      
      HEURISTICS
      
      The logic deals with four cases:
      
      	- sequential-next
      		found a consistent readahead window, so push it forward
      
      	- random
      		standalone small read, so read as is
      
      	- sequential-first
      		create a new readahead window for a sequential/oversize request
      
      	- lookahead-clueless
      		hit a lookahead page not associated with the readahead window,
      		so create a new readahead window and ramp it up
      
      In each case, three parameters are determined:
      
      	- readahead index: where the next readahead begins
      	- readahead size:  how much to readahead
      	- lookahead size:  when to do the next readahead (for pipelining)
      
      BEHAVIORS
      
      The old behaviors are maximally preserved for trivial sequential/random reads.
      Notable changes are:
      
      	- It no longer imposes strict sequential checks.
      	  It might help some interleaved cases, and clustered random reads.
      	  It does introduce risks of a random lookahead hit triggering an
      	  unexpected readahead. But in general it is more likely to do good
      	  than to do evil.
      
      	- Interleaved reads are supported in a minimal way.
      	  Their chances of being detected and proper handled are still low.
      
      	- Readahead thrashings are better handled.
      	  The current readahead leads to tiny average I/O sizes, because it
      	  never turn back for the thrashed pages.  They have to be fault in
      	  by do_generic_mapping_read() one by one.  Whereas the on-demand
      	  readahead will redo readahead for them.
      
      OVERHEADS
      
      The new code reduced the overheads of
      
      	- excessively calling the readahead routine on small sized reads
      	  (the current readahead code insists on seeing all requests)
      
      	- doing a lot of pointless page-cache lookups for small cached files
      	  (the current readahead only turns itself off after 256 cache hits,
      	  unfortunately most files are < 1MB, so never see that chance)
      
      That accounts for speedup of
      	- 0.3% on 1-page sequential reads on sparse file
      	- 1.2% on 1-page cache hot sequential reads
      	- 3.2% on 256-page cache hot sequential reads
      	- 1.3% on cache hot `tar /lib`
      
      However, it does introduce one extra page-cache lookup per cache miss, which
      impacts random reads slightly. That's 1% overheads for 1-page random reads on
      sparse file.
      
      PERFORMANCE
      
      The basic benchmark setup is
      	- 2.6.20 kernel with on-demand readahead
      	- 1MB max readahead size
      	- 2.9GHz Intel Core 2 CPU
      	- 2GB memory
      	- 160G/8M Hitachi SATA II 7200 RPM disk
      
      The benchmarks show that
      	- it maintains the same performance for trivial sequential/random reads
      	- sysbench/OLTP performance on MySQL gains up to 8%
      	- performance on readahead thrashing gains up to 3 times
      
      iozone throughput (KB/s): roughly the same
      ==========================================
      iozone -c -t1 -s 4096m -r 64k
      
      			       2.6.20          on-demand      gain
      first run
      	  "  Initial write "   61437.27        64521.53      +5.0%
      	  "        Rewrite "   47893.02        48335.20      +0.9%
      	  "           Read "   62111.84        62141.49      +0.0%
      	  "        Re-read "   62242.66        62193.17      -0.1%
      	  "   Reverse Read "   50031.46        49989.79      -0.1%
      	  "    Stride read "    8657.61         8652.81      -0.1%
      	  "    Random read "   13914.28        13898.23      -0.1%
      	  " Mixed workload "   19069.27        19033.32      -0.2%
      	  "   Random write "   14849.80        14104.38      -5.0%
      	  "         Pwrite "   62955.30        65701.57      +4.4%
      	  "          Pread "   62209.99        62256.26      +0.1%
      
      second run
      	  "  Initial write "   60810.31        66258.69      +9.0%
      	  "        Rewrite "   49373.89        57833.66     +17.1%
      	  "           Read "   62059.39        62251.28      +0.3%
      	  "        Re-read "   62264.32        62256.82      -0.0%
      	  "   Reverse Read "   49970.96        50565.72      +1.2%
      	  "    Stride read "    8654.81         8638.45      -0.2%
      	  "    Random read "   13901.44        13949.91      +0.3%
      	  " Mixed workload "   19041.32        19092.04      +0.3%
      	  "   Random write "   14019.99        14161.72      +1.0%
      	  "         Pwrite "   64121.67        68224.17      +6.4%
      	  "          Pread "   62225.08        62274.28      +0.1%
      
      In summary, writes are unstable, reads are pretty close on average:
      
      			  access pattern  2.6.20  on-demand   gain
      				   Read  62085.61  62196.38  +0.2%
      				Re-read  62253.49  62224.99  -0.0%
      			   Reverse Read  50001.21  50277.75  +0.6%
      			    Stride read   8656.21   8645.63  -0.1%
      			    Random read  13907.86  13924.07  +0.1%
      	 		 Mixed workload  19055.29  19062.68  +0.0%
      				  Pread  62217.53  62265.27  +0.1%
      
      aio-stress: roughly the same
      ============================
      aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
      aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso
      
      					2.6.20      on-demand  delta
      			sequential	 92.57s      92.54s    -0.0%
      			random		311.87s     312.15s    +0.1%
      
      sysbench fileio: roughly the same
      =================================
      sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
      	 --file-total-size=4G --file-block-size=64K \
      	 --num-threads=001 --max-requests=10000 --max-time=900 run
      
      				threads    2.6.20   on-demand    delta
      		first run
      				      1   59.1974s    59.2262s  +0.0%
      				      2   58.0575s    58.2269s  +0.3%
      				      4   48.0545s    47.1164s  -2.0%
      				      8   41.0684s    41.2229s  +0.4%
      				     16   35.8817s    36.4448s  +1.6%
      				     32   32.6614s    32.8240s  +0.5%
      				     64   23.7601s    24.1481s  +1.6%
      				    128   24.3719s    23.8225s  -2.3%
      				    256   23.2366s    22.0488s  -5.1%
      
      		second run
      				      1   59.6720s    59.5671s  -0.2%
      				      8   41.5158s    41.9541s  +1.1%
      				     64   25.0200s    23.9634s  -4.2%
      				    256   22.5491s    20.9486s  -7.1%
      
      Note that the numbers are not very stable because of the writes.
      The overall performance is close when we sum all seconds up:
      
                      sum all up               495.046s    491.514s   -0.7%
      
      sysbench oltp (trans/sec): up to 8% gain
      ========================================
      sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
      	 --mysql-socket=/var/run/mysqld/mysqld.sock \
      	 --mysql-user=root --mysql-password=readahead \
      	 --num-threads=064 --max-requests=10000 --max-time=900 run
      
      	10000-transactions run
      				threads    2.6.20   on-demand    gain
      				      1     62.81       64.56   +2.8%
      				      2     67.97       70.93   +4.4%
      				      4     81.81       85.87   +5.0%
      				      8     94.60       97.89   +3.5%
      				     16     99.07      104.68   +5.7%
      				     32     95.93      104.28   +8.7%
      				     64     96.48      103.68   +7.5%
      	5000-transactions run
      				      1     48.21       48.65   +0.9%
      				      8     68.60       70.19   +2.3%
      				     64     70.57       74.72   +5.9%
      	2000-transactions run
      				      1     37.57       38.04   +1.3%
      				      2     38.43       38.99   +1.5%
      				      4     45.39       46.45   +2.3%
      				      8     51.64       52.36   +1.4%
      				     16     54.39       55.18   +1.5%
      				     32     52.13       54.49   +4.5%
      				     64     54.13       54.61   +0.9%
      
      That's interesting results. Some investigations show that
      	- MySQL is accessing the db file non-uniformly: some parts are
      	  more hot than others
      	- It is mostly doing 4-page random reads, and sometimes doing two
      	  reads in a row, the latter one triggers a 16-page readahead.
      	- The on-demand readahead leaves many lookahead pages (flagged
      	  PG_readahead) there. Many of them will be hit, and trigger
      	  more readahead pages. Which might save more seeks.
      	- Naturally, the readahead windows tend to lie in hot areas,
      	  and the lookahead pages in hot areas is more likely to be hit.
      	- The more overall read density, the more possible gain.
      
      That also explains the adaptive readahead tricks for clustered random reads.
      
      readahead thrashing: 3 times better
      ===================================
      We boot kernel with "mem=128m single", and start a 100KB/s stream on every
      second, until reaching 200 streams.
      
      			      max throughput     min avg I/O size
      		2.6.20:            5MB/s               16KB
      		on-demand:        15MB/s              140KB
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Steven Pratt <slpratt@austin.ibm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      122a21d1
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
    • N
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin 提交于
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821