1. 09 2月, 2008 1 次提交
  2. 08 2月, 2008 1 次提交
    • H
      memcgroup: fix hang with shmem/tmpfs · 82369553
      Hugh Dickins 提交于
      The memcgroup regime relies upon a cgroup reclaiming pages from itself within
      add_to_page_cache: which may involve some waiting.  Whereas shmem and tmpfs
      rely upon using add_to_page_cache while holding a spinlock: when it cannot
      wait.  The consequence is that when a cgroup reaches its limit, shmem_getpage
      just hangs - unless there is outside memory pressure too, neither kswapd nor
      radix_tree_preload get it out of the retry loop.
      
      In most cases we can mem_cgroup_cache_charge the page waitably first, to
      attach the page_cgroup in advance, so add_to_page_cache will do no more than
      increment a count; then mem_cgroup_uncharge_page after (in both success and
      failure cases) to balance the books again.
      
      And where there used to be a congestion_wait for kswapd (recently made
      redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
      to go through a cycle of allocation and freeing, without accounting to any
      particular page, and without updating the statistics vector.  This brings the
      cgroup below its limit so the next try usually succeeds.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82369553
  3. 06 2月, 2008 16 次提交
    • D
      VFS/Security: Rework inode_getsecurity and callers to return resulting buffer · 42492594
      David P. Quigley 提交于
      This patch modifies the interface to inode_getsecurity to have the function
      return a buffer containing the security blob and its length via parameters
      instead of relying on the calling function to give it an appropriately sized
      buffer.
      
      Security blobs obtained with this function should be freed using the
      release_secctx LSM hook.  This alleviates the problem of the caller having to
      guess a length and preallocate a buffer for this function allowing it to be
      used elsewhere for Labeled NFS.
      
      The patch also removed the unused err parameter.  The conversion is similar to
      the one performed by Al Viro for the security_getprocattr hook.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42492594
    • H
      tmpfs: fix shmem_swaplist races · 1b1b32f2
      Hugh Dickins 提交于
      Intensive swapoff testing shows shmem_unuse spinning on an entry in
      shmem_swaplist pointing to itself: how does that come about?  Days pass...
      
      First guess is this: shmem_delete_inode tests list_empty without taking the
      global mutex (so the swapping case doesn't slow down the common case); but
      there's an instant in shmem_unuse_inode's list_move_tail when the list entry
      may appear empty (a rare case, because it's actually moving the head not the
      the list member).  So there's a danger of leaving the inode on the swaplist
      when it's freed, then reinitialized to point to itself when reused.  Fix that
      by skipping the list_move_tail when it's a no-op, which happens to plug this.
      
      But this same spinning then surfaces on another machine.  Ah, I'd never
      suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
      still hold page lock, which would hold off inode deletion if the page were in
      pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
      doesn't wait on locked pages).  Hmm: we could put the the inode on swaplist
      earlier, but then shmem_unuse_inode could never prune unswapped inodes.
      
      Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
      though I am a little uneasy about the iput which has to follow - it works, and
      I see nothing wrong with it, but it is surprising that shmem inode deletion
      may now occur below shmem_writepage.  Revisit this fix later?
      
      And while we're looking at these races: the way shmem_unuse tests swapped
      without holding info->lock looks unsafe, if we've more than one swap area: a
      racing shmem_writepage on another page of the same inode could be putting it
      in swapcache, just as we're deciding to remove the inode from swaplist -
      there's a danger of going on swap without being listed, so a later swapoff
      would hang, being unable to locate the entry.  Move that test and removal down
      into shmem_unuse_inode, once info->lock is held.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b1b32f2
    • H
      tmpfs: radix_tree_preloading · b409f9fc
      Hugh Dickins 提交于
      Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
      or swap cache, without any radix tree preload: so tending to deplete emergency
      reserves of memory.
      
      GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
      being called under memory pressure, so must not wait for more memory to become
      available.  But shmem_unuse_inode now has a window in which it can and should
      preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
      add_to_page_cache.
      
      shmem_getpage is not so straightforward: its filepage/swappage integrity
      relies upon exchanging between caches under spinlock, and it would need a lot
      of restructuring to place the preloads correctly.  Instead, follow its pattern
      of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
      add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
      radix_tree_preload, followed immediately by radix_tree_preload_end - that
      won't guarantee success in the next add_to_page_cache, but doesn't need to.
      
      And we can then remove that bothersome congestion_wait: when needed, it'll
      automatically get done in the course of the radix_tree_preload.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Looks-good-to: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b409f9fc
    • H
      tmpfs: open a window in shmem_unuse_inode · 2e0e26c7
      Hugh Dickins 提交于
      There are a couple of reasons (patches follow) why it would be good to open a
      window for sleep in shmem_unuse_inode, between its search for a matching swap
      entry, and its handling of the entry found.
      
      shmem_unuse_inode must then use igrab to hold the inode against deletion in
      that window, and its corresponding iput might result in deletion: so it had
      better unlock_page before the iput, and might as well release the page too.
      
      Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
      leave the loop.  So this unwinding moves from try_to_unuse and shmem_unuse
      into shmem_unuse_inode, in the case when it finds a match.
      
      Let try_to_unuse break on error in the shmem_unuse case, as it does in the
      unuse_mm case: though at this point in the series, no error to break on.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e0e26c7
    • H
      tmpfs: make shmem_unuse more preemptible · cb5f7b9a
      Hugh Dickins 提交于
      shmem_unuse is at present an unbroken search through every swap vector page of
      every tmpfs file which might be swapped, all under shmem_swaplist_lock.  This
      dates from long ago, when the caller held mmlist_lock over it all too: long
      gone, but there's never been much pressure for preemptible swapoff.
      
      Make it a little more preemptible, replacing shmem_swaplist_lock by
      shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
      cond_resched_lock (on info->lock) at one convenient point in the
      shmem_unuse_inode loop, where it has no outstanding kmap_atomic.
      
      If we're serious about preemptible swapoff, there's much further to go e.g.
      I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
      case dictate preemptiblility for other configs.  But as in the earlier patch
      to make swapoff scan ptes preemptibly, my hidden agenda is really towards
      making memcgroups work, hardly about preemptibility at all.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb5f7b9a
    • H
      tmpfs: allocate on read when stacked · a0ee5ec5
      Hugh Dickins 提交于
      tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or
      size=0).  But if a stacked filesystem such as unionfs gets pages from a sparse
      tmpfs file by reading holes, and then writes to them, it can easily exceed any
      such limit at present.
      
      So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
      reading for the kernel (a KERNEL_DS check, ugh, sorry about that).  Indeed,
      pessimistically mark such pages as dirty, so they cannot get reclaimed and
      unaccounted by mistake.  The venerable shmem_recalc_inode code (originally to
      account for the reclaim of clean pages) suffices to get the accounting right
      when swappages are dropped in favour of more uptodate filepages.
      
      This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
      caused by unionfs writing to a very sparse tmpfs file: to minimize memory
      allocation in swapout, tmpfs requires the swap vector be allocated upfront,
      which wasn't always happening in this stacked case.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0ee5ec5
    • H
      tmpfs: allow filepage alongside swappage · d9fe526a
      Hugh Dickins 提交于
      tmpfs has long allowed for a fresh filepage to be created in pagecache, just
      before shmem_getpage gets the chance to match it up with the swappage which
      already belongs to that offset.  But unionfs_writepage now does a
      find_or_create_page, divorced from shmem_getpage, which leaves conflicting
      filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.
      
      Therefore shmem_writepage (where a page is swizzled from file to swap) must
      now be on the lookout for existing swap, ready to free it in favour of the
      more uptodate filepage, instead of BUGging on that clash.  And when the
      add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
      filepage, otherwise swapoff would hang.  Whereas when add_to_page_cache fails
      in shmem_getpage, it should retry in the same way it already does.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9fe526a
    • H
      tmpfs: move swap swizzling into shmem · 73b1262f
      Hugh Dickins 提交于
      move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
      between tmpfs page cache and swap cache, to avoid page copying) are only used
      by shmem.c; and our subsequent fix for unionfs needs different treatments in
      the two instances of move_from_swap_cache.  Move them from swap_state.c into
      their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
      add_to_swap_cache externally visible.
      
      shmem.c likes to say set_page_dirty where swap_state.c liked to say
      SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
      makes moot (and implies we should lose that "shift page from clean_pages to
      dirty_pages list" comment: it's on neither).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b1262f
    • M
      tmpfs: fix mounts when size is less than the page size · 818db359
      Michael Marineau 提交于
      When tmpfs is mounted with a size less than one page, the number of blocks
      is set to 0 which makes the tmpfs mount unlimited.  This can lead to a
      quick and surprising death if someone typos a tmpfs mount command and
      writes too much.
      
      tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0,
      as Documentation/filesystems/tmpfs.txt says.
      
      Hugh: do this by rounding size up instead of down in all cases: which
      slightly expands other odd-sized tmpfs mounts, but in a consistent way.
      Signed-off-by: NMichael Marineau <mike@marineau.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      818db359
    • P
      shmem: factor out sbi->free_inodes manipulations · 5b04c689
      Pavel Emelyanov 提交于
      The shmem_sb_info structure has a number of free_inodes. This
      value is altered in appropriate places under spinlock and with
      the sbi->max_inodes != 0 check.
      
      Consolidate these manipulations into two helpers.
      
      This is minus 42 bytes of shmem.o and minus 4 :) lines of code.
      
      [akpm@linux-foundation.org: fix error return values]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b04c689
    • H
      shmem_file_write is redundant · 5402b976
      Hugh Dickins 提交于
      With the old aops, writing to a tmpfs file had to use its own special method:
      the generic method would pass in a fresh page to prepare_write when the right
      page was there in swapcache - which was inefficient to handle, even once we'd
      concocted the code to handle it.
      
      With the new aops, the generic method uses shmem_write_end, which lets
      shmem_getpage find the right page: so now abandon shmem_file_write in favour
      of the generic method.  Yes, that does do several things that tmpfs hasn't
      really needed (notably balance_dirty_pages_ratelimited, which ramfs also
      calls); but more use of common code is preferable.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5402b976
    • H
      shmem_getpage return page locked · d3602444
      Hugh Dickins 提交于
      In the new aops, write_begin is supposed to return the page locked: though
      I've seen no ill effects, that's been overlooked in the case of
      shmem_write_begin, and should be fixed.  Then shmem_write_end must unlock the
      page: do so _after_ updating i_size, as we found to be important in other
      filesystems (though since shmem pages don't go the usual writeback route, they
      never suffered from that corruption).
      
      For shmem_write_begin to return the page locked, we need shmem_getpage to
      return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's
      simplify the interface and return it locked even when SGP_READ.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3602444
    • H
      shmem: SGP_QUICK and SGP_FAULT redundant · 27d54b39
      Hugh Dickins 提交于
      Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no
      users now.  Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and
      shmem_getpage is about to return with page always locked).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d54b39
    • H
      swapin needs gfp_mask for loop on tmpfs · 02098fea
      Hugh Dickins 提交于
      Building in a filesystem on a loop device on a tmpfs file can hang when
      swapping, the loop thread caught in that infamous throttle_vm_writeout.
      
      In theory this is a long standing problem, which I've either never seen in
      practice, or long ago suppressed the recollection, after discounting my load
      and my tmpfs size as unrealistically high.  But now, with the new aops, it has
      become easy to hang on one machine.
      
      Loop used to grab_cache_page before the old prepare_write to tmpfs, which
      seems to have been enough to free up some memory for any swapin needed; but
      the new write_begin lets tmpfs find or allocate the page (much nicer, since
      grab_cache_page missed tmpfs pages in swapcache).
      
      When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
      has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
      break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
      read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
      mapping_gfp_mask - hence the hang.
      
      So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
      swapin_readahead to read_swap_cache_async to add_to_swap_cache.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02098fea
    • H
      swapin_readahead: move and rearrange args · 46017e95
      Hugh Dickins 提交于
      swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
      beside its kindred read_swap_cache_async.  Why were its args in a different
      order?  rearrange them.  And since it was always followed by a
      read_swap_cache_async of the target page, fold that in and return struct
      page*.  Then CONFIG_SWAP=n no longer needs valid_swaphandles and
      read_swap_cache_async stubs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46017e95
    • H
      swapin_readahead: excise NUMA bogosity · c4cc6d07
      Hugh Dickins 提交于
      For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
      code, advancing addr, and stepping on to the next vma at the boundary, to line
      up the mempolicy for each page allocation.
      
      It _might_ be a good idea to allocate swap more according to vma layout; but
      the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
      allocated as needed for pages as they sink to the bottom of the inactive LRUs.
       Sometimes that may match vma layout, but not so often that it's worth going
      to these misleading vma->vm_next lengths: rip all that out.
      
      Originally I intended to retain the incrementation of addr, but correct its
      initial value: valid_swaphandles generally supplies an offset below the target
      addr (this is readaround rather than readahead), but addr has not been
      adjusted accordingly, so in the interleave case it has usually been allocating
      the target page from the "wrong" node (though that may not matter very much).
      
      But look at the equivalent shmem_swapin code: either by oversight or by
      design, though it has all the apparatus for choosing a new mempolicy per page,
      it uses the same idx throughout, choosing the same mempolicy and interleave
      node for each page of the cluster.
      
      Which is actually a much better strategy: each node has its own LRUs and its
      own kswapd, so if you're betting on any particular relationship between swap
      and node, the best bet is that nearby swap entries belong to pages from the
      same node - even when the mempolicy of the target page is to interleave.  And
      examining a map of nodes corresponding to swap entries on a numa=fake system
      bears this out.  (We could later tweak swap allocation to make it even more
      likely, but this patch is merely about removing cruft.)
      
      So, neither adjust nor increment addr in swapin_readahead, and then
      shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
      once per cluster, and so few fields of pvma are used, let's skip the memset -
      from shmem_alloc_page also.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4cc6d07
  4. 29 11月, 2007 1 次提交
  5. 30 10月, 2007 1 次提交
    • H
      fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE · 487e9bf2
      Hugh Dickins 提交于
      It's possible to provoke unionfs (not yet in mainline, though in mm and
      some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).  I expect
      it's possible to provoke the 2.6.23 ecryptfs in the same way (but the
      2.6.24 ecryptfs no longer calls lower level's ->writepage).
      
      This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could
      leak from tmpfs via write_cache_pages and unionfs to userspace.  There's
      already a fix (e4230030 - writeback: don't
      propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so
      far as it goes; but insufficient because it doesn't address the underlying
      issue, that shmem_writepage expects to be called only by vmscan (relying on
      backing_dev_info capabilities to prevent the normal writeback path from
      ever approaching it).
      
      That's an increasingly fragile assumption, and ramdisk_writepage (the other
      source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check
      wbc->for_reclaim before returning it.  Make the same check in
      shmem_writepage, thereby sidestepping the page_mapped BUG also.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Erez Zadok <ezk@cs.sunysb.edu>
      Cc: <stable@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      487e9bf2
  6. 22 10月, 2007 2 次提交
  7. 17 10月, 2007 9 次提交
  8. 20 7月, 2007 5 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
    • N
      mm: merge populate and nopage into fault (fixes nonlinear) · 54cb8821
      Nick Piggin 提交于
      Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
      the virtual address -> file offset differently from linear mappings.
      
      ->populate is a layering violation because the filesystem/pagecache code
      should need to know anything about the virtual memory mapping.  The hitch here
      is that the ->nopage handler didn't pass down enough information (ie.  pgoff).
       But it is more logical to pass pgoff rather than have the ->nopage function
      calculate it itself anyway (because that's a similar layering violation).
      
      Having the populate handler install the pte itself is likewise a nasty thing
      to be doing.
      
      This patch introduces a new fault handler that replaces ->nopage and
      ->populate and (later) ->nopfn.  Most of the old mechanism is still in place
      so there is a lot of duplication and nice cleanups that can be removed if
      everyone switches over.
      
      The rationale for doing this in the first place is that nonlinear mappings are
      subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
      to duplicate the synchronisation logic rather than just consolidate the two.
      
      After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
      pagecache.  Seems like a fringe functionality anyway.
      
      NOPAGE_REFAULT is removed.  This should be implemented with ->fault, and no
      users have hit mainline yet.
      
      [akpm@linux-foundation.org: cleanup]
      [randy.dunlap@oracle.com: doc. fixes for readahead]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Mark Fasheh <mark.fasheh@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54cb8821
    • N
      mm: fix fault vs invalidate race for linear mappings · d00806b1
      Nick Piggin 提交于
      Fix the race between invalidate_inode_pages and do_no_page.
      
      Andrea Arcangeli identified a subtle race between invalidation of pages from
      pagecache with userspace mappings, and do_no_page.
      
      The issue is that invalidation has to shoot down all mappings to the page,
      before it can be discarded from the pagecache.  Between shooting down ptes to
      a particular page, and actually dropping the struct page from the pagecache,
      do_no_page from any process might fault on that page and establish a new
      mapping to the page just before it gets discarded from the pagecache.
      
      The most common case where such invalidation is used is in file truncation.
      This case was catered for by doing a sort of open-coded seqlock between the
      file's i_size, and its truncate_count.
      
      Truncation will decrease i_size, then increment truncate_count before
      unmapping userspace pages; do_no_page will read truncate_count, then find the
      page if it is within i_size, and then check truncate_count under the page
      table lock and back out and retry if it had subsequently been changed (ptl
      will serialise against unmapping, and ensure a potentially updated
      truncate_count is actually visible).
      
      Complexity and documentation issues aside, the locking protocol fails in the
      case where we would like to invalidate pagecache inside i_size.  do_no_page
      can come in anytime and filemap_nopage is not aware of the invalidation in
      progress (as it is when it is outside i_size).  The end result is that
      dangling (->mapping == NULL) pages that appear to be from a particular file
      may be mapped into userspace with nonsense data.  Valid mappings to the same
      place will see a different page.
      
      Andrea implemented two working fixes, one using a real seqlock, another using
      a page->flags bit.  He also proposed using the page lock in do_no_page, but
      that was initially considered too heavyweight.  However, it is not a global or
      per-file lock, and the page cacheline is modified in do_no_page to increment
      _count and _mapcount anyway, so a further modification should not be a large
      performance hit.  Scalability is not an issue.
      
      This patch implements this latter approach.  ->nopage implementations return
      with the page locked if it is possible for their underlying file to be
      invalidated (in that case, they must set a special vm_flags bit to indicate
      so).  do_no_page only unlocks the page after setting up the mapping
      completely.  invalidation is excluded because it holds the page lock during
      invalidation of each page (and ensures that the page is not mapped while
      holding the lock).
      
      This also allows significant simplifications in do_no_page, because we have
      the page locked in the right place in the pagecache from the start.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d00806b1
  9. 18 7月, 2007 2 次提交
    • C
      knfsd: exportfs: add exportfs.h header · a5694255
      Christoph Hellwig 提交于
      currently the export_operation structure and helpers related to it are in
      fs.h.  fs.h is already far too large and there are very few places needing the
      export bits, so split them off into a separate header.
      
      [akpm@linux-foundation.org: fix cifs build]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Steven French <sfrench@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5694255
    • M
      Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated · 769848c0
      Mel Gorman 提交于
      It is often known at allocation time whether a page may be migrated or not.
      This patch adds a flag called __GFP_MOVABLE and a new mask called
      GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
      using the page migration mechanism or reclaimed by syncing with backing
      storage and discarding.
      
      An API function very similar to alloc_zeroed_user_highpage() is added for
      __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
      flags used by alloc_zeroed_user_highpage() are not changed because it would
      change the semantics of an existing API.  After this patch is applied there
      are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
      be marked deprecated if this patch is merged.
      
      Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
      shmem.c to keep all flag modifications to inode->mapping in the
      shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
      Hugh Dickens.
      
      Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
      concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
      and ramfs allocations.
      
      [akpm@linux-foundation.org: build fix]
      [hugh@veritas.com: __GFP_ZERO cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      769848c0
  10. 10 7月, 2007 1 次提交
  11. 09 6月, 2007 1 次提交