1. 28 4月, 2008 1 次提交
  2. 27 4月, 2008 1 次提交
    • C
      s390: KVM preparation: host memory management changes for s390 kvm · 5b7baf05
      Christian Borntraeger 提交于
      This patch changes the s390 memory management defintions to use the pgste field
      for dirty and reference bit tracking of host and guest code. Usually on s390,
      dirty and referenced are tracked in storage keys, which belong to the physical
      page. This changes with virtualization: The guest and host dirty/reference bits
      are defined to be the logical OR of the values for the mapping and the physical
      page. This patch implements the necessary changes in pgtable.h for s390.
      
      There is a common code change in mm/rmap.c, the call to
      page_test_and_clear_young must be moved. This is a no-op for all
      architecture but s390. page_referenced checks the referenced bits for
      the physiscal page and for all mappings:
      o The physical page is checked with page_test_and_clear_young.
      o The mappings are checked with ptep_test_and_clear_young and friends.
      
      Without pgstes (the current implementation on Linux s390) the physical page
      check is implemented but the mapping callbacks are no-ops because dirty
      and referenced are not tracked in the s390 page tables. The pgstes introduces
      guest and host dirty and reference bits for s390 in the host mapping. These
      mapping must be checked before page_test_and_clear_young resets the reference
      bit.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      5b7baf05
  3. 20 3月, 2008 1 次提交
  4. 05 3月, 2008 1 次提交
  5. 10 2月, 2008 1 次提交
  6. 08 2月, 2008 2 次提交
    • B
      Memory controller: make page_referenced() cgroup aware · bed7161a
      Balbir Singh 提交于
      Make page_referenced() cgroup aware.  Without this patch, page_referenced()
      can cause a page to be skipped while reclaiming pages.  This patch ensures
      that other cgroups do not hold pages in a particular cgroup hostage.  It
      is required to ensure that shared pages are freed from a cgroup when they
      are not actively referenced from the cgroup that brought them in
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bed7161a
    • B
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh 提交于
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
  7. 06 2月, 2008 2 次提交
    • H
      mm: don't waste swap on locked pages · 5a9bbdcd
      Hugh Dickins 提交于
      try_to_unmap always fails on a page found in a VM_LOCKED vma (unless
      migrating), and recycles it back to the active list.  But if it's an
      anonymous page, we've already allocated swap to it: just wasting swap.
      Spot locked pages in page_referenced_one and treat them as referenced.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ethan Solomita <solo@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a9bbdcd
    • N
      radix-tree: avoid atomic allocations for preloaded insertions · e2848a0e
      Nick Piggin 提交于
      Most pagecache (and some other) radix tree insertions have the great
      opportunity to preallocate a few nodes with relaxed gfp flags.  But the
      preallocation is squandered when it comes time to allocate a node, we
      default to first attempting a GFP_ATOMIC allocation -- that doesn't
      normally fail, but it can eat into atomic memory reserves that we don't
      need to be using.
      
      Another upshot of this is that it removes the sometimes highly contended
      zone->lock from underneath tree_lock.  Pagecache insertions are always
      performed with a radix tree preload, and after this change, such a
      situation will never fall back to kmem_cache_alloc within
      radix_tree_node_alloc.
      
      David Miller reports seeing this allocation fail on a highly threaded
      sparc64 system:
      
      [527319.459981] dd: page allocation failure. order:0, mode:0x20
      [527319.460403] Call Trace:
      [527319.460568]  [00000000004b71e0] __slab_alloc+0x1b0/0x6a8
      [527319.460636]  [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8
      [527319.460698]  [000000000055309c] radix_tree_node_alloc+0x20/0x90
      [527319.460763]  [0000000000553238] radix_tree_insert+0x12c/0x260
      [527319.460830]  [0000000000495cd0] add_to_page_cache+0x38/0xb0
      [527319.460893]  [00000000004e4794] mpage_readpages+0x6c/0x134
      [527319.460955]  [000000000049c7fc] __do_page_cache_readahead+0x170/0x280
      [527319.461028]  [000000000049cc88] ondemand_readahead+0x208/0x214
      [527319.461094]  [0000000000496018] do_generic_mapping_read+0xe8/0x428
      [527319.461152]  [0000000000497948] generic_file_aio_read+0x108/0x170
      [527319.461217]  [00000000004badac] do_sync_read+0x88/0xd0
      [527319.461292]  [00000000004bb5cc] vfs_read+0x78/0x10c
      [527319.461361]  [00000000004bb920] sys_read+0x34/0x60
      [527319.461424]  [0000000000406294] linux_sparc_syscall32+0x3c/0x40
      
      The calltrace is significant: __do_page_cache_readahead allocates a number
      of pages with GFP_KERNEL, and hence it should have reclaimed sufficient
      memory to satisfy GFP_ATOMIC allocations.  However after the list of pages
      goes to mpage_readpages, there can be significant intervals (including disk
      IO) before all the pages are inserted into the radix-tree.  So the reserves
      can easily be depleted at that point.  The patch is confirmed to fix the
      problem.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2848a0e
  8. 20 11月, 2007 1 次提交
    • C
      [S390] Optimize storage key handling for anonymous pages · ce7e9fae
      Christian Borntraeger 提交于
      page_mkclean used to call page_clear_dirty for every given page. This
      is different to all other architectures, where the dirty bit in the
      PTEs is only resetted, if page_mapping() returns a non-NULL pointer.
      We can move the page_test_dirty/page_clear_dirty sequence into the
      2nd if to avoid unnecessary iske/sske sequences, which are expensive.
      
      This change also helps kvm for s390 as the host must transfer the
      dirty bit into the guest status bits. By moving the page_clear_dirty
      operation into the 2nd if, the vm will only call page_clear_dirty
      for pages where it walks the mapping anyway. There it calls
      ptep_clear_flush for writable ptes, so we can transfer the dirty bit
      to the guest.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      ce7e9fae
  9. 15 11月, 2007 1 次提交
    • L
      Migration: find correct vma in new_vma_page() · 3ad33b24
      Lee Schermerhorn 提交于
      We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
      mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas.  For
      anon-regions, we just fail to migrate any pages beyond the 1st vma in the
      range.
      
      This occurs because do_mbind() collects a list of pages to migrate by
      calling check_range().  check_range() walks the task's mm, spanning vmas as
      necessary, to collect the migratable pages into a list.  Then, do_mbind()
      calls migrate_pages() passing the list of pages, a function to allocate new
      pages based on vma policy [new_vma_page()], and a pointer to the first vma
      of the range.
      
      For each page in the list, new_vma_page() calls page_address_in_vma()
      passing the page and the vma [first in range] to obtain the address to get
      for alloc_page_vma().  The page address is needed to get interleaving
      policy correct.  If the pages in the list come from multiple vmas,
      eventually, new_page_address() will pass that page to page_address_in_vma()
      with the incorrect vma.  For !PageAnon pages, this will result in a bug
      check in rmap.c:vma_address().  For anon pages, vma_address() will just
      return EFAULT and fail the migration.
      
      This patch modifies new_vma_page() to check the return value from
      page_address_in_vma().  If the return value is EFAULT, new_vma_page()
      searchs forward via vm_next for the vma that maps the page--i.e., that does
      not return EFAULT.  This assumes that the pages in the list handed to
      migrate_pages() is in address order.  This is currently case.  The patch
      documents this assumption in a new comment block for new_vma_page().
      
      If new_vma_page() cannot locate the vma mapping the page in a forward
      search in the mm, it will pass a NULL vma to alloc_page_vma().  This will
      result in the allocation using the task policy, if any, else system default
      policy.  This situation is unlikely, but the patch documents this behavior
      with a comment.
      
      Note, this patch results in restarting from the first vma in a multi-vma
      range each time new_vma_page() is called.  If this is not acceptable, we
      can make the vma argument a pointer, both in new_vma_page() and it's caller
      unmap_and_move() so that the value held by the loop in migrate_pages()
      always passes down the last vma in which a page was found.  This will
      require changes to all new_page_t functions passed to migrate_pages().  Is
      this necessary?
      
      For this patch to work, we can't bug check in vma_address() for pages
      outside the argument vma.  This patch removes the BUG_ON().  All other
      callers [besides new_vma_page()] already check the return status.
      
      Tested on x86_64, 4 node NUMA platform.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad33b24
  10. 17 10月, 2007 3 次提交
  11. 20 7月, 2007 2 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
    • N
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin 提交于
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821
  12. 29 6月, 2007 1 次提交
    • H
      mm: kill validate_anon_vma to avoid mapcount BUG · 30acbaba
      Hugh Dickins 提交于
      validate_anon_vma gave a useful check on the integrity of the anon_vma list
      when Andrea was developing obj rmap; but it was not enabled in SLES9
      itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to
      configurable CONFIG_DEBUG_VM in 2.6.17.  Now Petr Vandrovec reports that
      its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system.
      
      That limit was just an arbitrary number to protect against an infinite
      loop.  We could raise it to something enormous (depending on sizeof struct
      vma and size of memory?); but I rather think validate_anon_vma has outlived
      its usefulness, and is better just removed - which gives a magnificent
      performance boost to anything like Petr's test program ;)
      
      Of course, a very long anon_vma list is bad news for preemption latency,
      and I believe there has been one recent report of such: let's not forget
      that, but validate_anon_vma only makes it worse not better.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Petr Vandrovec <petr@vmware.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30acbaba
  13. 17 5月, 2007 2 次提交
    • N
      mm: more rmap checking · c97a9e10
      Nick Piggin 提交于
      Re-introduce rmap verification patches that Hugh removed when he removed
      PG_map_lock. PG_map_lock actually isn't needed to synchronise access to
      anonymous pages, because PG_locked and PTL together already do.
      
      These checks were important in discovering and fixing a rare rmap corruption
      in SLES9.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c97a9e10
    • C
      Remove SLAB_CTOR_CONSTRUCTOR · a35afb83
      Christoph Lameter 提交于
      SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Steven French <sfrench@us.ibm.com>
      Cc: Michael Halcrow <mhalcrow@us.ibm.com>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dave Kleikamp <shaggy@austin.ibm.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Anton Altaparmakov <aia21@cantab.net>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@ucw.cz>
      Cc: David Chinner <dgc@sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a35afb83
  14. 09 5月, 2007 1 次提交
  15. 08 5月, 2007 1 次提交
    • C
      slab allocators: Remove SLAB_DEBUG_INITIAL flag · 50953fe9
      Christoph Lameter 提交于
      I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
      SLAB.
      
      I think its purpose was to have a callback after an object has been freed
      to verify that the state is the constructor state again?  The callback is
      performed before each freeing of an object.
      
      I would think that it is much easier to check the object state manually
      before the free.  That also places the check near the code object
      manipulation of the object.
      
      Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
      compiled with SLAB debugging on.  If there would be code in a constructor
      handling SLAB_DEBUG_INITIAL then it would have to be conditional on
      SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
      in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
      use of, difficult to understand and there are easier ways to accomplish the
      same effect (i.e.  add debug code before kfree).
      
      There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
      clear in fs inode caches.  Remove the pointless checks (they would even be
      pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.
      
      This is the last slab flag that SLUB did not support.  Remove the check for
      unimplemented flags from SLUB.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50953fe9
  16. 27 4月, 2007 1 次提交
    • M
      [S390] split page_test_and_clear_dirty. · 6c210482
      Martin Schwidefsky 提交于
      The page_test_and_clear_dirty primitive really consists of two
      operations, page_test_dirty and the page_clear_dirty. The combination
      of the two is not an atomic operation, so it makes more sense to have
      two separate operations instead of one.
      In addition to the improved readability of the s390 version of
      SetPageUptodate, it now avoids the page_test_dirty operation which is
      an insert-storage-key-extended (iske) instruction which is an expensive
      operation.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      6c210482
  17. 04 4月, 2007 1 次提交
    • M
      [S390] page_mkclean data corruption. · 6e1beb3c
      Martin Schwidefsky 提交于
      The git commit c2fda5fe which
      added the page_test_and_clear_dirty call to page_mkclean and the
      git commit 7658cc28 which fixes
      the "nasty and subtle race in shared mmap'ed page writeback"
      problem in clear_page_dirty_for_io cause data corruption on s390.
      
      The effect of the two changes is that for every call to
      clear_page_dirty_for_io a page_test_and_clear_dirty is done. If
      the per page dirty bit is set set_page_dirty is called. Strangly
      clear_page_dirty_for_io is called for not-uptodate pages, e.g.
      over this call-chain:
      
       [<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130
       [<000000000007c494>] generic_writepages+0x258/0x3e0
       [<000000000007c692>] do_writepages+0x76/0x7c
       [<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4
       [<00000000000c831a>] sync_sb_inodes+0x23e/0x398
       [<00000000000c8802>] writeback_inodes+0x12e/0x140
       [<000000000007b9ee>] wb_kupdate+0xd2/0x178
       [<000000000007cca2>] pdflush+0x162/0x23c
      
      The bad news now is that page_test_and_clear_dirty might claim
      that a not-uptodate page is dirty since SetPageUptodate which
      resets the per page dirty bit has not yet been called. The page
      writeback that follows clobbers the data on disk.
      
      The simplest solution to this problem is to move the call to
      page_test_and_clear_dirty under the "if (page_mapped(page))".
      If a file backed page is mapped it is uptodate.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      6e1beb3c
  18. 02 3月, 2007 1 次提交
  19. 31 12月, 2006 1 次提交
  20. 23 12月, 2006 2 次提交
  21. 21 10月, 2006 1 次提交
  22. 12 10月, 2006 1 次提交
  23. 26 9月, 2006 1 次提交
    • P
      [PATCH] mm: tracking shared dirty pages · d08b3851
      Peter Zijlstra 提交于
      Tracking of dirty pages in shared writeable mmap()s.
      
      The idea is simple: write protect clean shared writeable pages, catch the
      write-fault, make writeable and set dirty.  On page write-back clean all the
      PTE dirty bits and write protect them once again.
      
      The implementation is a tad harder, mainly because the default
      backing_dev_info capabilities were too loosely maintained.  Hence it is not
      enough to test the backing_dev_info for cap_account_dirty.
      
      The current heuristic is as follows, a VMA is eligible when:
       - its shared writeable
          (vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)
       - it is not a 'special' mapping
          (vm_flags & (VM_PFNMAP|VM_INSERTPAGE)) == 0
       - the backing_dev_info is cap_account_dirty
          mapping_cap_account_dirty(vma->vm_file->f_mapping)
       - f_op->mmap() didn't change the default page protection
      
      Page from remap_pfn_range() are explicitly excluded because their COW
      semantics are already horrid enough (see vm_normal_page() in do_wp_page()) and
      because they don't have a backing store anyway.
      
      mprotect() is taught about the new behaviour as well.  However it overrides
      the last condition.
      
      Cleaning the pages on write-back is done with page_mkclean() a new rmap call.
      It can be called on any page, but is currently only implemented for mapped
      pages, if the page is found the be of a VMA that accounts dirty pages it will
      also wrprotect the PTE.
      
      Finally, in fs/buffers.c:try_to_free_buffers(); remove clear_page_dirty() from
      under ->private_lock.  This seems to be safe, since ->private_lock is used to
      serialize access to the buffers, not the page itself.  This is needed because
      clear_page_dirty() will call into page_mkclean() and would thereby violate
      locking order.
      
      [dhowells@redhat.com: Provide a page_mkclean() implementation for NOMMU]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d08b3851
  24. 01 7月, 2006 2 次提交
    • C
      [PATCH] zoned vm counters: split NR_ANON_PAGES off from NR_FILE_MAPPED · f3dbd344
      Christoph Lameter 提交于
      The current NR_FILE_MAPPED is used by zone reclaim and the dirty load
      calculation as the number of mapped pagecache pages.  However, that is not
      true.  NR_FILE_MAPPED includes the mapped anonymous pages.  This patch
      separates those and therefore allows an accurate tracking of the anonymous
      pages per zone.
      
      It then becomes possible to determine the number of unmapped pages per zone
      and we can avoid scanning for unmapped pages if there are none.
      
      Also it may now be possible to determine the mapped/unmapped ratio in
      get_dirty_limit.  Isnt the number of anonymous pages irrelevant in that
      calculation?
      
      Note that this will change the meaning of the number of mapped pages reported
      in /proc/vmstat /proc/meminfo and in the per node statistics.  This may affect
      user space tools that monitor these counters!  NR_FILE_MAPPED works like
      NR_FILE_DIRTY.  It is only valid for pagecache pages.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      f3dbd344
    • C
      [PATCH] zoned vm counters: convert nr_mapped to per zone counter · 65ba55f5
      Christoph Lameter 提交于
      nr_mapped is important because it allows a determination of how many pages of
      a zone are not mapped, which would allow a more efficient means of determining
      when we need to reclaim memory in a zone.
      
      We take the nr_mapped field out of the page state structure and define a new
      per zone counter named NR_FILE_MAPPED (the anonymous pages will be split off
      from NR_MAPPED in the next patch).
      
      We replace the use of nr_mapped in various kernel locations.  This avoids the
      looping over all processors in try_to_free_pages(), writeback, reclaim (swap +
      zone reclaim).
      
      [akpm@osdl.org: bugfix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      65ba55f5
  25. 26 6月, 2006 1 次提交
  26. 23 6月, 2006 5 次提交
    • C
      [PATCH] More page migration: use migration entries for file pages · 04e62a29
      Christoph Lameter 提交于
      This implements the use of migration entries to preserve ptes of file backed
      pages during migration.  Processes can therefore be migrated back and forth
      without loosing their connection to pagecache pages.
      
      Note that we implement the migration entries only for linear mappings.
      Nonlinear mappings still require the unmapping of the ptes for migration.
      
      And another writepage() ugliness shows up.  writepage() can drop the page
      lock.  Therefore we have to remove migration ptes before calling writepages()
      in order to avoid having migration entries point to unlocked pages.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      04e62a29
    • C
      [PATCH] More page migration: do not inc/dec rss counters · 442c9137
      Christoph Lameter 提交于
      If we install a migration entry then the rss not really decreases since the
      page is just moved somewhere else.  We can save ourselves the work of
      decrementing and later incrementing which will just eventually cause cacheline
      bouncing.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      442c9137
    • C
      [PATCH] Swapless page migration: rip out swap based logic · d75a0fcd
      Christoph Lameter 提交于
      Rip the page migration logic out.
      
      Remove all code that has to do with swapping during page migration.
      
      This also guts the ability to migrate pages to swap.  No one used that so lets
      let it go for good.
      
      Page migration should be a bit broken after this patch.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d75a0fcd
    • C
      [PATCH] Swapless page migration: add R/W migration entries · 0697212a
      Christoph Lameter 提交于
      Implement read/write migration ptes
      
      We take the upper two swapfiles for the two types of migration ptes and define
      a series of macros in swapops.h.
      
      The VM is modified to handle the migration entries.  migration entries can
      only be encountered when the page they are pointing to is locked.  This limits
      the number of places one has to fix.  We also check in copy_pte_range and in
      mprotect_pte_range() for migration ptes.
      
      We check for migration ptes in do_swap_cache and call a function that will
      then wait on the page lock.  This allows us to effectively stop all accesses
      to apge.
      
      Migration entries are created by try_to_unmap if called for migration and
      removed by local functions in migrate.c
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Several times while testing swapless page migration (I've no NUMA, just
        hacking it up to migrate recklessly while running load), I've hit the
        BUG_ON(!PageLocked(p)) in migration_entry_to_page.
      
        This comes from an orphaned migration entry, unrelated to the current
        correctly locked migration, but hit by remove_anon_migration_ptes as it
        checks an address in each vma of the anon_vma list.
      
        Such an orphan may be left behind if an earlier migration raced with fork:
        copy_one_pte can duplicate a migration entry from parent to child, after
        remove_anon_migration_ptes has checked the child vma, but before it has
        removed it from the parent vma.  (If the process were later to fault on this
        orphaned entry, it would hit the same BUG from migration_entry_wait.)
      
        This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
        not.  There's no such problem with file pages, because vma_prio_tree_add
        adds child vma after parent vma, and the page table locking at each end is
        enough to serialize.  Follow that example with anon_vma: add new vmas to the
        tail instead of the head.
      
        (There's no corresponding problem when inserting migration entries,
        because a missed pte will leave the page count and mapcount high, which is
        allowed for.  And there's no corresponding problem when migrating via swap,
        because a leftover swap entry will be correctly faulted.  But the swapless
        method has no refcounting of its entries.)
      
      From: Ingo Molnar <mingo@elte.hu>
      
        pte_unmap_unlock() takes the pte pointer as an argument.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Several times while testing swapless page migration, gcc has tried to exec
        a pointer instead of a string: smells like COW mappings are not being
        properly write-protected on fork.
      
        The protection in copy_one_pte looks very convincing, until at last you
        realize that the second arg to make_migration_entry is a boolean "write",
        and SWP_MIGRATION_READ is 30.
      
        Anyway, it's better done like in change_pte_range, using
        is_write_migration_entry and make_migration_entry_read.
      
      From: Hugh Dickins <hugh@veritas.com>
      
        Remove unnecessary obfuscation from sys_swapon's range check on swap type,
        which blew up causing memory corruption once swapless migration made
        MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NChristoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      From: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0697212a
    • C
      [PATCH] page migration cleanup: rename "ignrefs" to "migration" · 7352349a
      Christoph Lameter 提交于
      migrate is a better name since it is only used by page migration.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7352349a
  27. 22 3月, 2006 2 次提交