1. 20 10月, 2008 2 次提交
    • L
      SHM_LOCKED pages are unevictable · 89e004ea
      Lee Schermerhorn 提交于
      Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
      kept on the normal LRU, since scanning them is a waste of time and might
      throw off kswapd's balancing algorithms.  Place them on the unevictable
      LRU list instead.
      
      Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
      memory regions as unevictable.  Then these pages will be culled off the
      normal LRU lists during vmscan.
      
      Add new wrapper function to clear the mapping's unevictable state when/if
      shared memory segment is munlocked.
      
      Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
      the shmem segment's mapping [struct address_space] for evictability now
      that they're no longer locked.  If so, move them to the appropriate zone
      lru list.
      
      Changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [kosaki.motohiro@jp.fujitsu.com: revert shm change]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e004ea
    • L
      Ramfs and Ram Disk pages are unevictable · ba9ddf49
      Lee Schermerhorn 提交于
      Christoph Lameter pointed out that ram disk pages also clutter the LRU
      lists.  When vmscan finds them dirty and tries to clean them, the ram disk
      writeback function just redirties the page so that it goes back onto the
      active list.  Round and round she goes...
      
      With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
      longer the case, as ram disk pages are no longer maintained on the lru.
      [This makes them unmigratable for defrag or memory hot remove, but that
      can be addressed by a separate patch series.] However, the ramfs pages
      behave like ram disk pages used to, so:
      
      Define new address_space flag [shares address_space flags member with
      mapping's gfp mask] to indicate that the address space contains all
      unevictable pages.  This will provide for efficient testing of ramfs pages
      in page_evictable().
      
      Also provide wrapper functions to set/test the unevictable state to
      minimize #ifdefs in ramfs driver and any other users of this facility.
      
      Set the unevictable state on address_space structures for new ramfs
      inodes.  Test the unevictable state in page_evictable() to cull
      unevictable pages.
      
      These changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [riel@redhat.com: undo the brd.c part]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Debugged-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba9ddf49
  2. 05 8月, 2008 1 次提交
  3. 30 7月, 2008 1 次提交
  4. 29 7月, 2008 1 次提交
    • A
      mmu-notifiers: add mm_take_all_locks() operation · 7906d00c
      Andrea Arcangeli 提交于
      mm_take_all_locks holds off reclaim from an entire mm_struct.  This allows
      mmu notifiers to register into the mm at any time with the guarantee that
      no mmu operation is in progress on the mm.
      
      This operation locks against the VM for all pte/vma/mm related operations
      that could ever happen on a certain mm.  This includes vmtruncate,
      try_to_unmap, and all page faults.
      
      The caller must take the mmap_sem in write mode before calling
      mm_take_all_locks().  The caller isn't allowed to release the mmap_sem
      until mm_drop_all_locks() returns.
      
      mmap_sem in write mode is required in order to block all operations that
      could modify pagetables and free pages without need of altering the vma
      layout (for example populate_range() with nonlinear vmas).  It's also
      needed in write mode to avoid new anon_vmas to be associated with existing
      vmas.
      
      A single task can't take more than one mm_take_all_locks() in a row or it
      would deadlock.
      
      mm_take_all_locks() and mm_drop_all_locks are expensive operations that
      may have to take thousand of locks.
      
      mm_take_all_locks() can fail if it's interrupted by signals.
      
      When mmu_notifier_register returns, we must be sure that the driver is
      notified if some task is in the middle of a vmtruncate for the 'mm' where
      the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
      is run around the vmtruncation but mmu_notifier_register can run after
      mmu_notifier_invalidate_range_start and before
      mmu_notifier_invalidate_range_end).  Same problem for rmap paths.  And
      we've to remove page pinning to avoid replicating the tlb_gather logic
      inside KVM (and GRU doesn't work well with page pinning regardless of
      needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
      the page, kvm would have no way to notice that it mapped into sptes a page
      that is going into the freelist without a chance of any further
      mmu_notifier notification.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7906d00c
  5. 27 7月, 2008 1 次提交
    • N
      mm: speculative page references · e286781d
      Nick Piggin 提交于
      If we can be sure that elevating the page_count on a pagecache page will
      pin it, we can speculatively run this operation, and subsequently check to
      see if we hit the right page rather than relying on holding a lock or
      otherwise pinning a reference to the page.
      
      This can be done if get_page/put_page behaves consistently throughout the
      whole tree (ie.  if we "get" the page after it has been used for something
      else, we must be able to free it with a put_page).
      
      Actually, there is a period where the count behaves differently: when the
      page is free or if it is a constituent page of a compound page.  We need
      an atomic_inc_not_zero operation to ensure we don't try to grab the page
      in either case.
      
      This patch introduces the core locking protocol to the pagecache (ie.
      adds page_cache_get_speculative, and tweaks some update-side code to make
      it work).
      
      Thanks to Hugh for pointing out an improvement to the algorithm setting
      page_count to zero when we have control of all references, in order to
      hold off speculative getters.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
      [hugh@veritas.com: fix add_to_page_cache]
      [akpm@linux-foundation.org: repair a comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e286781d
  6. 25 7月, 2008 1 次提交
  7. 14 2月, 2008 1 次提交
  8. 07 12月, 2007 1 次提交
  9. 17 10月, 2007 3 次提交
    • N
      fs: introduce write_begin, write_end, and perform_write aops · afddba49
      Nick Piggin 提交于
      These are intended to replace prepare_write and commit_write with more
      flexible alternatives that are also able to avoid the buffered write
      deadlock problems efficiently (which prepare_write is unable to do).
      
      [mark.fasheh@oracle.com: API design contributions, code review and fixes]
      [akpm@linux-foundation.org: various fixes]
      [dmonakhov@sw.ru: new aop block_write_begin fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afddba49
    • N
      mm: fix pagecache write deadlocks · 08291429
      Nick Piggin 提交于
      Modify the core write() code so that it won't take a pagefault while holding a
      lock on the pagecache page. There are a number of different deadlocks possible
      if we try to do such a thing:
      
      1.  generic_buffered_write
      2.   lock_page
      3.    prepare_write
      4.     unlock_page+vmtruncate
      5.     copy_from_user
      6.      mmap_sem(r)
      7.       handle_mm_fault
      8.        lock_page (filemap_nopage)
      9.    commit_write
      10.  unlock_page
      
      a. sys_munmap / sys_mlock / others
      b.  mmap_sem(w)
      c.   make_pages_present
      d.    get_user_pages
      e.     handle_mm_fault
      f.      lock_page (filemap_nopage)
      
      2,8	- recursive deadlock if page is same
      2,8;2,8	- ABBA deadlock is page is different
      2,6;b,f	- ABBA deadlock if page is same
      
      The solution is as follows:
      1.  If we find the destination page is uptodate, continue as normal, but use
          atomic usercopies which do not take pagefaults and do not zero the uncopied
          tail of the destination. The destination is already uptodate, so we can
          commit_write the full length even if there was a partial copy: it does not
          matter that the tail was not modified, because if it is dirtied and written
          back to disk it will not cause any problems (uptodate *means* that the
          destination page is as new or newer than the copy on disk).
      
      1a. The above requires that fault_in_pages_readable correctly returns access
          information, because atomic usercopies cannot distinguish between
          non-present pages in a readable mapping, from lack of a readable mapping.
      
      2.  If we find the destination page is non uptodate, unlock it (this could be
          made slightly more optimal), then allocate a temporary page to copy the
          source data into. Relock the destination page and continue with the copy.
          However, instead of a usercopy (which might take a fault), copy the data
          from the pinned temporary page via the kernel address space.
      
      (also, rename maxlen to seglen, because it was confusing)
      
      This increases the CPU/memory copy cost by almost 50% on the affected
      workloads. That will be solved by introducing a new set of pagecache write
      aops in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08291429
    • F
      filemap: convert some unsigned long to pgoff_t · 57f6b96c
      Fengguang Wu 提交于
      Convert some 'unsigned long' to pgoff_t.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f6b96c
  10. 09 5月, 2007 1 次提交
  11. 08 5月, 2007 1 次提交
  12. 10 2月, 2007 1 次提交
  13. 29 10月, 2006 1 次提交
  14. 26 9月, 2006 1 次提交
    • N
      [PATCH] mm: non syncing lock_page() · db37648c
      Nick Piggin 提交于
      lock_page needs the caller to have a reference on the page->mapping inode
      due to sync_page, ergo set_page_dirty_lock is obviously buggy according to
      its comments.
      
      Solve it by introducing a new lock_page_nosync which does not do a sync_page.
      
      akpm: unpleasant solution to an unpleasant problem.  If it goes wrong it could
      cause great slowdowns while the lock_page() caller waits for kblockd to
      perform the unplug.  And if a filesystem has special sync_page() requirements
      (none presently do), permanent hangs are possible.
      
      otoh, set_page_dirty_lock() is usually (always?) called against userspace
      pages.  They are always up-to-date, so there shouldn't be any pending read I/O
      against these pages.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      db37648c
  15. 01 7月, 2006 1 次提交
  16. 23 6月, 2006 1 次提交
  17. 27 4月, 2006 1 次提交
  18. 01 4月, 2006 1 次提交
  19. 24 3月, 2006 1 次提交
    • P
      [PATCH] cpuset memory spread page cache implementation and hooks · 44110fe3
      Paul Jackson 提交于
      Change the page cache allocation calls to support cpuset memory spreading.
      
      See the previous patch, cpuset_mem_spread, for an explanation of cpuset memory
      spreading.
      
      On systems without cpusets configured in the kernel, this is no change.
      
      On systems with cpusets configured in the kernel, but the "memory_spread"
      cpuset option not enabled for the current tasks cpuset, this adds a call to a
      cpuset routine and failed bit test of the processor state flag PF_SPREAD_PAGE.
      
      On tasks in cpusets with "memory_spread" enabled, this adds a call to a cpuset
      routine that computes which of the tasks mems_allowed nodes should be
      preferred for this allocation.
      
      If memory spreading applies to a particular allocation, then any other NUMA
      mempolicy does not apply.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      44110fe3
  20. 14 11月, 2005 1 次提交
  21. 28 10月, 2005 2 次提交
  22. 09 10月, 2005 1 次提交
  23. 22 6月, 2005 1 次提交
    • M
      [PATCH] VM: add __GFP_NORECLAIM · 0c35bbad
      Martin Hicks 提交于
      When using the early zone reclaim, it was noticed that allocating new pages
      that should be spread across the whole system caused eviction of local pages.
      
      This adds a new GFP flag to prevent early reclaim from happening during
      certain allocation attempts.  The example that is implemented here is for page
      cache pages.  We want page cache pages to be spread across the whole system,
      and we don't want page cache pages to evict other pages to get local memory.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0c35bbad
  24. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4