1. 05 1月, 2009 1 次提交
    • N
      fs: symlink write_begin allocation context fix · 54566b2c
      Nick Piggin 提交于
      With the write_begin/write_end aops, page_symlink was broken because it
      could no longer pass a GFP_NOFS type mask into the point where the
      allocations happened.  They are done in write_begin, which would always
      assume that the filesystem can be entered from reclaim.  This bug could
      cause filesystem deadlocks.
      
      The funny thing with having a gfp_t mask there is that it doesn't really
      allow the caller to arbitrarily tinker with the context in which it can be
      called.  It couldn't ever be GFP_ATOMIC, for example, because it needs to
      take the page lock.  The only thing any callers care about is __GFP_FS
      anyway, so turn that into a single flag.
      
      Add a new flag for write_begin, AOP_FLAG_NOFS.  Filesystems can now act on
      this flag in their write_begin function.  Change __grab_cache_page to
      accept a nofs argument as well, to honour that flag (while we're there,
      change the name to grab_cache_page_write_begin which is more instructive
      and does away with random leading underscores).
      
      This is really a more flexible way to go in the end anyway -- if a
      filesystem happens to want any extra allocations aside from the pagecache
      ones in ints write_begin function, it may now use GFP_KERNEL (rather than
      GFP_NOFS) for common case allocations (eg.  ocfs2_alloc_write_ctxt, for a
      random example).
      
      [kosaki.motohiro@jp.fujitsu.com: fix ubifs]
      [kosaki.motohiro@jp.fujitsu.com: fix fuse]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Cleaned up the calling convention: just pass in the AOP flags
        untouched to the grab_cache_page_write_begin() function.  That
        just simplifies everybody, and may even allow future expansion of the
        logic.   - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54566b2c
  2. 20 10月, 2008 4 次提交
  3. 05 8月, 2008 1 次提交
  4. 30 7月, 2008 1 次提交
  5. 29 7月, 2008 1 次提交
    • A
      mmu-notifiers: add mm_take_all_locks() operation · 7906d00c
      Andrea Arcangeli 提交于
      mm_take_all_locks holds off reclaim from an entire mm_struct.  This allows
      mmu notifiers to register into the mm at any time with the guarantee that
      no mmu operation is in progress on the mm.
      
      This operation locks against the VM for all pte/vma/mm related operations
      that could ever happen on a certain mm.  This includes vmtruncate,
      try_to_unmap, and all page faults.
      
      The caller must take the mmap_sem in write mode before calling
      mm_take_all_locks().  The caller isn't allowed to release the mmap_sem
      until mm_drop_all_locks() returns.
      
      mmap_sem in write mode is required in order to block all operations that
      could modify pagetables and free pages without need of altering the vma
      layout (for example populate_range() with nonlinear vmas).  It's also
      needed in write mode to avoid new anon_vmas to be associated with existing
      vmas.
      
      A single task can't take more than one mm_take_all_locks() in a row or it
      would deadlock.
      
      mm_take_all_locks() and mm_drop_all_locks are expensive operations that
      may have to take thousand of locks.
      
      mm_take_all_locks() can fail if it's interrupted by signals.
      
      When mmu_notifier_register returns, we must be sure that the driver is
      notified if some task is in the middle of a vmtruncate for the 'mm' where
      the mmu notifier was registered (mmu_notifier_invalidate_range_start/end
      is run around the vmtruncation but mmu_notifier_register can run after
      mmu_notifier_invalidate_range_start and before
      mmu_notifier_invalidate_range_end).  Same problem for rmap paths.  And
      we've to remove page pinning to avoid replicating the tlb_gather logic
      inside KVM (and GRU doesn't work well with page pinning regardless of
      needing tlb_gather), so without mm_take_all_locks when vmtruncate frees
      the page, kvm would have no way to notice that it mapped into sptes a page
      that is going into the freelist without a chance of any further
      mmu_notifier notification.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7906d00c
  6. 27 7月, 2008 1 次提交
    • N
      mm: speculative page references · e286781d
      Nick Piggin 提交于
      If we can be sure that elevating the page_count on a pagecache page will
      pin it, we can speculatively run this operation, and subsequently check to
      see if we hit the right page rather than relying on holding a lock or
      otherwise pinning a reference to the page.
      
      This can be done if get_page/put_page behaves consistently throughout the
      whole tree (ie.  if we "get" the page after it has been used for something
      else, we must be able to free it with a put_page).
      
      Actually, there is a period where the count behaves differently: when the
      page is free or if it is a constituent page of a compound page.  We need
      an atomic_inc_not_zero operation to ensure we don't try to grab the page
      in either case.
      
      This patch introduces the core locking protocol to the pagecache (ie.
      adds page_cache_get_speculative, and tweaks some update-side code to make
      it work).
      
      Thanks to Hugh for pointing out an improvement to the algorithm setting
      page_count to zero when we have control of all references, in order to
      hold off speculative getters.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
      [hugh@veritas.com: fix add_to_page_cache]
      [akpm@linux-foundation.org: repair a comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e286781d
  7. 25 7月, 2008 1 次提交
  8. 14 2月, 2008 1 次提交
  9. 07 12月, 2007 1 次提交
  10. 17 10月, 2007 3 次提交
    • N
      fs: introduce write_begin, write_end, and perform_write aops · afddba49
      Nick Piggin 提交于
      These are intended to replace prepare_write and commit_write with more
      flexible alternatives that are also able to avoid the buffered write
      deadlock problems efficiently (which prepare_write is unable to do).
      
      [mark.fasheh@oracle.com: API design contributions, code review and fixes]
      [akpm@linux-foundation.org: various fixes]
      [dmonakhov@sw.ru: new aop block_write_begin fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afddba49
    • N
      mm: fix pagecache write deadlocks · 08291429
      Nick Piggin 提交于
      Modify the core write() code so that it won't take a pagefault while holding a
      lock on the pagecache page. There are a number of different deadlocks possible
      if we try to do such a thing:
      
      1.  generic_buffered_write
      2.   lock_page
      3.    prepare_write
      4.     unlock_page+vmtruncate
      5.     copy_from_user
      6.      mmap_sem(r)
      7.       handle_mm_fault
      8.        lock_page (filemap_nopage)
      9.    commit_write
      10.  unlock_page
      
      a. sys_munmap / sys_mlock / others
      b.  mmap_sem(w)
      c.   make_pages_present
      d.    get_user_pages
      e.     handle_mm_fault
      f.      lock_page (filemap_nopage)
      
      2,8	- recursive deadlock if page is same
      2,8;2,8	- ABBA deadlock is page is different
      2,6;b,f	- ABBA deadlock if page is same
      
      The solution is as follows:
      1.  If we find the destination page is uptodate, continue as normal, but use
          atomic usercopies which do not take pagefaults and do not zero the uncopied
          tail of the destination. The destination is already uptodate, so we can
          commit_write the full length even if there was a partial copy: it does not
          matter that the tail was not modified, because if it is dirtied and written
          back to disk it will not cause any problems (uptodate *means* that the
          destination page is as new or newer than the copy on disk).
      
      1a. The above requires that fault_in_pages_readable correctly returns access
          information, because atomic usercopies cannot distinguish between
          non-present pages in a readable mapping, from lack of a readable mapping.
      
      2.  If we find the destination page is non uptodate, unlock it (this could be
          made slightly more optimal), then allocate a temporary page to copy the
          source data into. Relock the destination page and continue with the copy.
          However, instead of a usercopy (which might take a fault), copy the data
          from the pinned temporary page via the kernel address space.
      
      (also, rename maxlen to seglen, because it was confusing)
      
      This increases the CPU/memory copy cost by almost 50% on the affected
      workloads. That will be solved by introducing a new set of pagecache write
      aops in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08291429
    • F
      filemap: convert some unsigned long to pgoff_t · 57f6b96c
      Fengguang Wu 提交于
      Convert some 'unsigned long' to pgoff_t.
      Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57f6b96c
  11. 09 5月, 2007 1 次提交
  12. 08 5月, 2007 1 次提交
  13. 10 2月, 2007 1 次提交
  14. 29 10月, 2006 1 次提交
  15. 26 9月, 2006 1 次提交
    • N
      [PATCH] mm: non syncing lock_page() · db37648c
      Nick Piggin 提交于
      lock_page needs the caller to have a reference on the page->mapping inode
      due to sync_page, ergo set_page_dirty_lock is obviously buggy according to
      its comments.
      
      Solve it by introducing a new lock_page_nosync which does not do a sync_page.
      
      akpm: unpleasant solution to an unpleasant problem.  If it goes wrong it could
      cause great slowdowns while the lock_page() caller waits for kblockd to
      perform the unplug.  And if a filesystem has special sync_page() requirements
      (none presently do), permanent hangs are possible.
      
      otoh, set_page_dirty_lock() is usually (always?) called against userspace
      pages.  They are always up-to-date, so there shouldn't be any pending read I/O
      against these pages.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      db37648c
  16. 01 7月, 2006 1 次提交
  17. 23 6月, 2006 1 次提交
  18. 27 4月, 2006 1 次提交
  19. 01 4月, 2006 1 次提交
  20. 24 3月, 2006 1 次提交
    • P
      [PATCH] cpuset memory spread page cache implementation and hooks · 44110fe3
      Paul Jackson 提交于
      Change the page cache allocation calls to support cpuset memory spreading.
      
      See the previous patch, cpuset_mem_spread, for an explanation of cpuset memory
      spreading.
      
      On systems without cpusets configured in the kernel, this is no change.
      
      On systems with cpusets configured in the kernel, but the "memory_spread"
      cpuset option not enabled for the current tasks cpuset, this adds a call to a
      cpuset routine and failed bit test of the processor state flag PF_SPREAD_PAGE.
      
      On tasks in cpusets with "memory_spread" enabled, this adds a call to a cpuset
      routine that computes which of the tasks mems_allowed nodes should be
      preferred for this allocation.
      
      If memory spreading applies to a particular allocation, then any other NUMA
      mempolicy does not apply.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      44110fe3
  21. 14 11月, 2005 1 次提交
  22. 28 10月, 2005 2 次提交
  23. 09 10月, 2005 1 次提交
  24. 22 6月, 2005 1 次提交
    • M
      [PATCH] VM: add __GFP_NORECLAIM · 0c35bbad
      Martin Hicks 提交于
      When using the early zone reclaim, it was noticed that allocating new pages
      that should be spread across the whole system caused eviction of local pages.
      
      This adds a new GFP flag to prevent early reclaim from happening during
      certain allocation attempts.  The example that is implemented here is for page
      cache pages.  We want page cache pages to be spread across the whole system,
      and we don't want page cache pages to evict other pages to get local memory.
      Signed-off-by: NMartin Hicks <mort@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      0c35bbad
  25. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4