1. 27 7月, 2008 1 次提交
    • N
      mm: speculative page references · e286781d
      Nick Piggin 提交于
      If we can be sure that elevating the page_count on a pagecache page will
      pin it, we can speculatively run this operation, and subsequently check to
      see if we hit the right page rather than relying on holding a lock or
      otherwise pinning a reference to the page.
      
      This can be done if get_page/put_page behaves consistently throughout the
      whole tree (ie.  if we "get" the page after it has been used for something
      else, we must be able to free it with a put_page).
      
      Actually, there is a period where the count behaves differently: when the
      page is free or if it is a constituent page of a compound page.  We need
      an atomic_inc_not_zero operation to ensure we don't try to grab the page
      in either case.
      
      This patch introduces the core locking protocol to the pagecache (ie.
      adds page_cache_get_speculative, and tweaks some update-side code to make
      it work).
      
      Thanks to Hugh for pointing out an improvement to the algorithm setting
      page_count to zero when we have control of all references, in order to
      hold off speculative getters.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix migration_entry_wait()]
      [hugh@veritas.com: fix add_to_page_cache]
      [akpm@linux-foundation.org: repair a comment]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jeff Garzik <jeff@garzik.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Reviewed-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e286781d
  2. 26 7月, 2008 2 次提交
    • K
      memcg: helper function for relcaim from shmem. · c9b0ed51
      KAMEZAWA Hiroyuki 提交于
      A new call, mem_cgroup_shrink_usage() is added for shmem handling and
      relacing non-standard usage of mem_cgroup_charge/uncharge.
      
      Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
      mem_cgroup.  In general, shmem is used by some process group and not for
      global resource (like file caches).  So, it's reasonable to reclaim pages
      from mem_cgroup where shmem is mainly used.
      
      [hugh@veritas.com: shmem_getpage release page sooner]
      [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b0ed51
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
  3. 25 7月, 2008 1 次提交
    • H
      tmpfs: support aio · bcd78e49
      Hugh Dickins 提交于
      We have a request for tmpfs to support the AIO interface: easily done, no
      more than replacing the old shmem_file_read by shmem_file_aio_read,
      cribbed from generic_file_aio_read.  (In 2.6.25 its write side was already
      changed to use generic_file_aio_write.)
      
      Incorporate cleanups from Andrew Morton and Harvey Harrison.
      
      Tests out fine with LTP's ltp-aiodio.sh, given hacks (not included) to
      support O_DIRECT.  tmpfs cannot honestly support O_DIRECT: its
      cache-avoiding-IO nature is at odds with direct IO-avoiding-cache.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Tested-by: NLawrence Greenfield <leg@google.com>
      Cc: Christoph Rohland <hans-christoph.rohland@sap.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcd78e49
  4. 30 4月, 2008 1 次提交
  5. 28 4月, 2008 9 次提交
    • L
      mempolicy: use struct mempolicy pointer in shmem_sb_info · 71fe804b
      Lee Schermerhorn 提交于
      This patch replaces the mempolicy mode, mode_flags, and nodemask in the
      shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
      This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
      inode.c and simplifies the interfaces.
      
      mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
      pointer arg, a struct mempolicy pointer on success.  For MPOL_DEFAULT, the
      returned pointer is NULL.  Further, mpol_parse_str() now takes a 'no_context'
      argument that causes the input nodemask to be stored in the w.user_nodemask of
      the created mempolicy for use when the mempolicy is installed in a tmpfs inode
      shared policy tree.  At that time, any cpuset contextualization is applied to
      the original input nodemask.  This preserves the previous behavior where the
      input nodemask was stored in the superblock.  We can think of the returned
      mempolicy as "context free".
      
      Because mpol_parse_str() is now calling mpol_new(), we can remove from
      mpol_to_str() the semantic checks that mpol_new() already performs.
      
      Add 'no_context' parameter to mpol_to_str() to specify that it should format
      the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.
      
      Change mpol_shared_policy_init() to take a pointer to a "context free" struct
      mempolicy and to create a new, "contextualized" mempolicy using the mode,
      mode_flags and user_nodemask from the input mempolicy.
      
        Note: we know that the mempolicy passed to mpol_to_str() or
        mpol_shared_policy_init() from a tmpfs superblock is "context free".  This
        is currently the only instance thereof.  However, if we found more uses for
        this concept, and introduced any ambiguity as to whether a mempolicy was
        context free or not, we could add another internal mode flag to identify
        context free mempolicies.  Then, we could remove the 'no_context' argument
        from mpol_to_str().
      
      Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
      if one exists, to pass to mpol_shared_policy_init().  We must add the
      reference under the sb stat_lock to prevent races with replacement of the mpol
      by remount.  This reference is removed in mpol_shared_policy_init().
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: yet another build fix]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe804b
    • L
      mempolicy: rework shmem mpol parsing and display · 095f1fc4
      Lee Schermerhorn 提交于
      mm/shmem.c currently contains functions to parse and display memory policy
      strings for the tmpfs 'mpol' mount option.  Move this to mm/mempolicy.c with
      the rest of the mempolicy support.  With subsequent patches, we'll be able to
      remove knowledge of the details [mode, flags, policy, ...] completely from
      shmem.c
      
      1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
         mm/mempolicy.c.  Rework to use the policy_types[] array [used by
         mpol_to_str()] to look up mode by name.
      
      2) use mpol_to_str() to format policy for shmem_show_mpol().  mpol_to_str()
         expects a pointer to a struct mempolicy, so temporarily construct one.
         This will be replaced with a reference to a struct mempolicy in the tmpfs
         superblock in a subsequent patch.
      
         NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
         sign '=' as the nodemask delimiter to match mpol_parse_str() and the
         tmpfs/shmem mpol mount option formatting that now uses mpol_to_str().  This
         is a user visible change to numa_maps, but then the addition of the mode
         flags already changed the display.  It makes sense to me to have the mounts
         and numa_maps display the policy in the same format.  However, if anyone
         objects strongly, I can pass the desired nodemask delimeter as an arg to
         mpol_to_str().
      
         Note 2: Like show_numa_map(), I don't check the return code from
         mpol_to_str().  I do use a longer buffer than the one provided by
         show_numa_map(), which seems to have sufficed so far.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      095f1fc4
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn 提交于
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • L
      mempolicy: fix parsing of tmpfs mpol mount option · a43361cf
      Lee Schermerhorn 提交于
      Parsing of new mode flags in the tmpfs mpol mount option is slightly broken:
      
      Setting a valid flag works OK:
      	#mount -o remount,mpol=bind=static:1-2 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2)
      	...
      
      However, we can't remove them or change them, once we've
      set a valid flag:
      
      	#mount -o remount,mpol=bind:1-2 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2)
      	...
      
      It SAYS it removed it, but that's just a copy of the input
      string.  If we now try to set it to a different flag, we
      get:
      
      	#mount -o remount,mpol=bind=relative:1-2 /dev/shm
      	mount: /dev/shm not mounted already, or bad option
      
      And on the console, we see:
      	tmpfs: Bad value 'bind' for mount option 'mpol'
      	                      ^ lost remainder of string
      
      Furthermore, bogus flags are accepted with out error.
      Granted, they are a no-op:
      
      	#mount -o remount,mpol=interleave=foo:0-3 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3)
      
      Again, that's just a copy of the input string shown by the mount command.
      
      This patch fixes the behavior by pre-zeroing the flags so that only one of the
      mutually exclusive flags can be set at one time.  It also reports an error
      when an unrecognized flag is specified.
      
      The check for both flags being set is removed because it can't happen with
      this implementation.  If we ever want to support multiple non-exclusive flags,
      this area will need rework and we will need to check that any mutually
      exclusive flags aren't specified.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43361cf
    • D
      mempolicy: add MPOL_F_RELATIVE_NODES flag · 4c50bc01
      David Rientjes 提交于
      Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
      nodemasks passed via set_mempolicy() or mbind() should be considered relative
      to the current task's mems_allowed.
      
      When the mempolicy is created, the passed nodemask is folded and mapped onto
      the current task's mems_allowed.  For example, consider a task using
      set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
      nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
      5-7 (the second, third, and fourth node of mems_allowed).
      
      If the same task is attached to a cpuset, the mempolicy nodemask is rebound
      each time the mems are changed.  Some possible rebinds and results are:
      
      	mems			result
      	1-3			1-3
      	1-7			2-4
      	1,5-6			1,5-6
      	1,5-7			5-7
      
      Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
      to the resultant nodemask from the relative remap.
      
      In the MPOL_PREFERRED case, the preferred node is remapped from the currently
      effected nodemask to the relative nodemask.
      
      This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c50bc01
    • D
      mempolicy: add MPOL_F_STATIC_NODES flag · f5b087b5
      David Rientjes 提交于
      Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
      node remap when the policy is rebound.
      
      Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
      a union with cpuset_mems_allowed:
      
      	struct mempolicy {
      		...
      		union {
      			nodemask_t cpuset_mems_allowed;
      			nodemask_t user_nodemask;
      		} w;
      	}
      
      that stores the the nodemask that the user passed when he or she created the
      mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
      which is passed with any mempolicy mode, the user's passed nodemask
      intersected with the VMA or task's allowed nodes is always used when
      determining the preferred node, setting the MPOL_BIND zonelist, or creating
      the interleave nodemask.  This happens whenever the policy is rebound,
      including when a task's cpuset assignment changes or the cpuset's mems are
      changed.
      
      This creates an interesting side-effect in that it allows the mempolicy
      "intent" to lie dormant and uneffected until it has access to the node(s) that
      it desires.  For example, if you currently ask for an interleaved policy over
      a set of nodes that you do not have access to, the mempolicy is not created
      and the task continues to use the previous policy.  With this change, however,
      it is possible to create the same mempolicy; it is only effected when access
      to nodes in the nodemask is acquired.
      
      It is also possible to mount tmpfs with the static nodemask behavior when
      specifying a node or nodemask.  To do this, simply add "=static" immediately
      following the mempolicy mode at mount time:
      
      	mount -o remount mpol=interleave=static:1-3
      
      Also removes mpol_check_policy() and folds its logic into mpol_new() since it
      is now obsoleted.  The unused vma_mpol_equal() is also removed.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5b087b5
    • D
      mempolicy: support optional mode flags · 028fec41
      David Rientjes 提交于
      With the evolution of mempolicies, it is necessary to support mempolicy mode
      flags that specify how the policy shall behave in certain circumstances.  The
      most immediate need for mode flag support is to suppress remapping the
      nodemask of a policy at the time of rebind.
      
      Both the mempolicy mode and flags are passed by the user in the 'int policy'
      formal of either the set_mempolicy() or mbind() syscall.  A new constant,
      MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
      passed as part of this int.  Mempolicies that include illegal flags as part of
      their policy are rejected as invalid.
      
      An additional member to struct mempolicy is added to support the mode flags:
      
      	struct mempolicy {
      		...
      		unsigned short policy;
      		unsigned short flags;
      	}
      
      The splitting of the 'int' actual passed by the user is done in
      sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
      done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
      there are additional flags, and storing it in the new 'flags' member of struct
      mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
      the 'policy' member of the struct and all current users of pol->policy remain
      unchanged.
      
      The union of the policy mode and optional mode flags is passed back to the
      user in get_mempolicy().
      
      This combination of mode and flags within the same actual does not break
      userspace code that relies on get_mempolicy(&policy, ...) and either
      
      	switch (policy) {
      	case MPOL_BIND:
      		...
      	case MPOL_INTERLEAVE:
      		...
      	};
      
      statements or
      
      	if (policy == MPOL_INTERLEAVE) {
      		...
      	}
      
      statements.  Such applications would need to use optional mode flags when
      calling set_mempolicy() or mbind() for these previously implemented statements
      to stop working.  If an application does start using optional mode flags, it
      will need to mask the optional flags off the policy in switch and conditional
      statements that only test mode.
      
      An additional member is also added to struct shmem_sb_info to store the
      optional mode flags.
      
      [hugh@veritas.com: shmem mpol: fix build warning]
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fec41
    • D
      mempolicy: convert MPOL constants to enum · a3b51e01
      David Rientjes 提交于
      The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
      MPOL_INTERLEAVE, are better declared as part of an enum since they are
      sequentially numbered and cannot be combined.
      
      The policy member of struct mempolicy is also converted from type short to
      type unsigned short.  A negative policy does not have any legitimate meaning,
      so it is possible to change its type in preparation for adding optional mode
      flags later.
      
      The equivalent member of struct shmem_sb_info is also changed from int to
      unsigned short.
      
      For compatibility, the policy formal to get_mempolicy() remains as a pointer
      to an int:
      
      	int get_mempolicy(int *policy, unsigned long *nmask,
      			  unsigned long maxnode, unsigned long addr,
      			  unsigned long flags);
      
      although the only possible values is the range of type unsigned short.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b51e01
  6. 20 3月, 2008 1 次提交
  7. 05 3月, 2008 1 次提交
  8. 09 2月, 2008 2 次提交
  9. 08 2月, 2008 1 次提交
    • H
      memcgroup: fix hang with shmem/tmpfs · 82369553
      Hugh Dickins 提交于
      The memcgroup regime relies upon a cgroup reclaiming pages from itself within
      add_to_page_cache: which may involve some waiting.  Whereas shmem and tmpfs
      rely upon using add_to_page_cache while holding a spinlock: when it cannot
      wait.  The consequence is that when a cgroup reaches its limit, shmem_getpage
      just hangs - unless there is outside memory pressure too, neither kswapd nor
      radix_tree_preload get it out of the retry loop.
      
      In most cases we can mem_cgroup_cache_charge the page waitably first, to
      attach the page_cgroup in advance, so add_to_page_cache will do no more than
      increment a count; then mem_cgroup_uncharge_page after (in both success and
      failure cases) to balance the books again.
      
      And where there used to be a congestion_wait for kswapd (recently made
      redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
      to go through a cycle of allocation and freeing, without accounting to any
      particular page, and without updating the statistics vector.  This brings the
      cgroup below its limit so the next try usually succeeds.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82369553
  10. 06 2月, 2008 16 次提交
    • D
      VFS/Security: Rework inode_getsecurity and callers to return resulting buffer · 42492594
      David P. Quigley 提交于
      This patch modifies the interface to inode_getsecurity to have the function
      return a buffer containing the security blob and its length via parameters
      instead of relying on the calling function to give it an appropriately sized
      buffer.
      
      Security blobs obtained with this function should be freed using the
      release_secctx LSM hook.  This alleviates the problem of the caller having to
      guess a length and preallocate a buffer for this function allowing it to be
      used elsewhere for Labeled NFS.
      
      The patch also removed the unused err parameter.  The conversion is similar to
      the one performed by Al Viro for the security_getprocattr hook.
      Signed-off-by: NDavid P. Quigley <dpquigl@tycho.nsa.gov>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42492594
    • H
      tmpfs: fix shmem_swaplist races · 1b1b32f2
      Hugh Dickins 提交于
      Intensive swapoff testing shows shmem_unuse spinning on an entry in
      shmem_swaplist pointing to itself: how does that come about?  Days pass...
      
      First guess is this: shmem_delete_inode tests list_empty without taking the
      global mutex (so the swapping case doesn't slow down the common case); but
      there's an instant in shmem_unuse_inode's list_move_tail when the list entry
      may appear empty (a rare case, because it's actually moving the head not the
      the list member).  So there's a danger of leaving the inode on the swaplist
      when it's freed, then reinitialized to point to itself when reused.  Fix that
      by skipping the list_move_tail when it's a no-op, which happens to plug this.
      
      But this same spinning then surfaces on another machine.  Ah, I'd never
      suspected it, but shmem_writepage's swaplist manipulation is unsafe: though we
      still hold page lock, which would hold off inode deletion if the page were in
      pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
      doesn't wait on locked pages).  Hmm: we could put the the inode on swaplist
      earlier, but then shmem_unuse_inode could never prune unswapped inodes.
      
      Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
      though I am a little uneasy about the iput which has to follow - it works, and
      I see nothing wrong with it, but it is surprising that shmem inode deletion
      may now occur below shmem_writepage.  Revisit this fix later?
      
      And while we're looking at these races: the way shmem_unuse tests swapped
      without holding info->lock looks unsafe, if we've more than one swap area: a
      racing shmem_writepage on another page of the same inode could be putting it
      in swapcache, just as we're deciding to remove the inode from swaplist -
      there's a danger of going on swap without being listed, so a later swapoff
      would hang, being unable to locate the entry.  Move that test and removal down
      into shmem_unuse_inode, once info->lock is held.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b1b32f2
    • H
      tmpfs: radix_tree_preloading · b409f9fc
      Hugh Dickins 提交于
      Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page cache
      or swap cache, without any radix tree preload: so tending to deplete emergency
      reserves of memory.
      
      GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache: it's
      being called under memory pressure, so must not wait for more memory to become
      available.  But shmem_unuse_inode now has a window in which it can and should
      preload with GFP_KERNEL, and say GFP_NOWAIT instead of GFP_ATOMIC in its
      add_to_page_cache.
      
      shmem_getpage is not so straightforward: its filepage/swappage integrity
      relies upon exchanging between caches under spinlock, and it would need a lot
      of restructuring to place the preloads correctly.  Instead, follow its pattern
      of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
      add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
      radix_tree_preload, followed immediately by radix_tree_preload_end - that
      won't guarantee success in the next add_to_page_cache, but doesn't need to.
      
      And we can then remove that bothersome congestion_wait: when needed, it'll
      automatically get done in the course of the radix_tree_preload.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Looks-good-to: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b409f9fc
    • H
      tmpfs: open a window in shmem_unuse_inode · 2e0e26c7
      Hugh Dickins 提交于
      There are a couple of reasons (patches follow) why it would be good to open a
      window for sleep in shmem_unuse_inode, between its search for a matching swap
      entry, and its handling of the entry found.
      
      shmem_unuse_inode must then use igrab to hold the inode against deletion in
      that window, and its corresponding iput might result in deletion: so it had
      better unlock_page before the iput, and might as well release the page too.
      
      Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
      leave the loop.  So this unwinding moves from try_to_unuse and shmem_unuse
      into shmem_unuse_inode, in the case when it finds a match.
      
      Let try_to_unuse break on error in the shmem_unuse case, as it does in the
      unuse_mm case: though at this point in the series, no error to break on.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e0e26c7
    • H
      tmpfs: make shmem_unuse more preemptible · cb5f7b9a
      Hugh Dickins 提交于
      shmem_unuse is at present an unbroken search through every swap vector page of
      every tmpfs file which might be swapped, all under shmem_swaplist_lock.  This
      dates from long ago, when the caller held mmlist_lock over it all too: long
      gone, but there's never been much pressure for preemptible swapoff.
      
      Make it a little more preemptible, replacing shmem_swaplist_lock by
      shmem_swaplist_mutex, inserting a cond_resched in the main loop, and a
      cond_resched_lock (on info->lock) at one convenient point in the
      shmem_unuse_inode loop, where it has no outstanding kmap_atomic.
      
      If we're serious about preemptible swapoff, there's much further to go e.g.
      I'm stupid to let the kmap_atomics of the decreasingly significant HIGHMEM
      case dictate preemptiblility for other configs.  But as in the earlier patch
      to make swapoff scan ptes preemptibly, my hidden agenda is really towards
      making memcgroups work, hardly about preemptibility at all.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb5f7b9a
    • H
      tmpfs: allocate on read when stacked · a0ee5ec5
      Hugh Dickins 提交于
      tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0 or
      size=0).  But if a stacked filesystem such as unionfs gets pages from a sparse
      tmpfs file by reading holes, and then writes to them, it can easily exceed any
      such limit at present.
      
      So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
      reading for the kernel (a KERNEL_DS check, ugh, sorry about that).  Indeed,
      pessimistically mark such pages as dirty, so they cannot get reclaimed and
      unaccounted by mistake.  The venerable shmem_recalc_inode code (originally to
      account for the reclaim of clean pages) suffices to get the accounting right
      when swappages are dropped in favour of more uptodate filepages.
      
      This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
      caused by unionfs writing to a very sparse tmpfs file: to minimize memory
      allocation in swapout, tmpfs requires the swap vector be allocated upfront,
      which wasn't always happening in this stacked case.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0ee5ec5
    • H
      tmpfs: allow filepage alongside swappage · d9fe526a
      Hugh Dickins 提交于
      tmpfs has long allowed for a fresh filepage to be created in pagecache, just
      before shmem_getpage gets the chance to match it up with the swappage which
      already belongs to that offset.  But unionfs_writepage now does a
      find_or_create_page, divorced from shmem_getpage, which leaves conflicting
      filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.
      
      Therefore shmem_writepage (where a page is swizzled from file to swap) must
      now be on the lookout for existing swap, ready to free it in favour of the
      more uptodate filepage, instead of BUGging on that clash.  And when the
      add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
      filepage, otherwise swapoff would hang.  Whereas when add_to_page_cache fails
      in shmem_getpage, it should retry in the same way it already does.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9fe526a
    • H
      tmpfs: move swap swizzling into shmem · 73b1262f
      Hugh Dickins 提交于
      move_to_swap_cache and move_from_swap_cache functions (which swizzle a page
      between tmpfs page cache and swap cache, to avoid page copying) are only used
      by shmem.c; and our subsequent fix for unionfs needs different treatments in
      the two instances of move_from_swap_cache.  Move them from swap_state.c into
      their callsites shmem_writepage, shmem_unuse_inode and shmem_getpage, making
      add_to_swap_cache externally visible.
      
      shmem.c likes to say set_page_dirty where swap_state.c liked to say
      SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
      makes moot (and implies we should lose that "shift page from clean_pages to
      dirty_pages list" comment: it's on neither).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73b1262f
    • M
      tmpfs: fix mounts when size is less than the page size · 818db359
      Michael Marineau 提交于
      When tmpfs is mounted with a size less than one page, the number of blocks
      is set to 0 which makes the tmpfs mount unlimited.  This can lead to a
      quick and surprising death if someone typos a tmpfs mount command and
      writes too much.
      
      tmpfs can still be mounted as unlimited if size or nr_blocks is exactly 0,
      as Documentation/filesystems/tmpfs.txt says.
      
      Hugh: do this by rounding size up instead of down in all cases: which
      slightly expands other odd-sized tmpfs mounts, but in a consistent way.
      Signed-off-by: NMichael Marineau <mike@marineau.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      818db359
    • P
      shmem: factor out sbi->free_inodes manipulations · 5b04c689
      Pavel Emelyanov 提交于
      The shmem_sb_info structure has a number of free_inodes. This
      value is altered in appropriate places under spinlock and with
      the sbi->max_inodes != 0 check.
      
      Consolidate these manipulations into two helpers.
      
      This is minus 42 bytes of shmem.o and minus 4 :) lines of code.
      
      [akpm@linux-foundation.org: fix error return values]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b04c689
    • H
      shmem_file_write is redundant · 5402b976
      Hugh Dickins 提交于
      With the old aops, writing to a tmpfs file had to use its own special method:
      the generic method would pass in a fresh page to prepare_write when the right
      page was there in swapcache - which was inefficient to handle, even once we'd
      concocted the code to handle it.
      
      With the new aops, the generic method uses shmem_write_end, which lets
      shmem_getpage find the right page: so now abandon shmem_file_write in favour
      of the generic method.  Yes, that does do several things that tmpfs hasn't
      really needed (notably balance_dirty_pages_ratelimited, which ramfs also
      calls); but more use of common code is preferable.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5402b976
    • H
      shmem_getpage return page locked · d3602444
      Hugh Dickins 提交于
      In the new aops, write_begin is supposed to return the page locked: though
      I've seen no ill effects, that's been overlooked in the case of
      shmem_write_begin, and should be fixed.  Then shmem_write_end must unlock the
      page: do so _after_ updating i_size, as we found to be important in other
      filesystems (though since shmem pages don't go the usual writeback route, they
      never suffered from that corruption).
      
      For shmem_write_begin to return the page locked, we need shmem_getpage to
      return the page locked in SGP_WRITE case as well as SGP_CACHE case: let's
      simplify the interface and return it locked even when SGP_READ.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3602444
    • H
      shmem: SGP_QUICK and SGP_FAULT redundant · 27d54b39
      Hugh Dickins 提交于
      Remove SGP_QUICK from the sgp_type enum: it was for shmem_populate and has no
      users now.  Remove SGP_FAULT from the enum: SGP_CACHE does just as well (and
      shmem_getpage is about to return with page always locked).
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d54b39
    • H
      swapin needs gfp_mask for loop on tmpfs · 02098fea
      Hugh Dickins 提交于
      Building in a filesystem on a loop device on a tmpfs file can hang when
      swapping, the loop thread caught in that infamous throttle_vm_writeout.
      
      In theory this is a long standing problem, which I've either never seen in
      practice, or long ago suppressed the recollection, after discounting my load
      and my tmpfs size as unrealistically high.  But now, with the new aops, it has
      become easy to hang on one machine.
      
      Loop used to grab_cache_page before the old prepare_write to tmpfs, which
      seems to have been enough to free up some memory for any swapin needed; but
      the new write_begin lets tmpfs find or allocate the page (much nicer, since
      grab_cache_page missed tmpfs pages in swapcache).
      
      When allocating a fresh page, tmpfs respects loop's mapping_gfp_mask, which
      has __GFP_IO|__GFP_FS stripped off, and throttle_vm_writeout is designed to
      break out when __GFP_IO or GFP_FS is unset; but when tmfps swaps in,
      read_swap_cache_async allocates with GFP_HIGHUSER_MOVABLE regardless of the
      mapping_gfp_mask - hence the hang.
      
      So, pass gfp_mask down the line from shmem_getpage to shmem_swapin to
      swapin_readahead to read_swap_cache_async to add_to_swap_cache.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02098fea
    • H
      swapin_readahead: move and rearrange args · 46017e95
      Hugh Dickins 提交于
      swapin_readahead has never sat well in mm/memory.c: move it to mm/swap_state.c
      beside its kindred read_swap_cache_async.  Why were its args in a different
      order?  rearrange them.  And since it was always followed by a
      read_swap_cache_async of the target page, fold that in and return struct
      page*.  Then CONFIG_SWAP=n no longer needs valid_swaphandles and
      read_swap_cache_async stubs.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46017e95
    • H
      swapin_readahead: excise NUMA bogosity · c4cc6d07
      Hugh Dickins 提交于
      For three years swapin_readahead has been cluttered with fanciful CONFIG_NUMA
      code, advancing addr, and stepping on to the next vma at the boundary, to line
      up the mempolicy for each page allocation.
      
      It _might_ be a good idea to allocate swap more according to vma layout; but
      the fact is, that's not how we do it at all, 2.6 even less than 2.4: swap is
      allocated as needed for pages as they sink to the bottom of the inactive LRUs.
       Sometimes that may match vma layout, but not so often that it's worth going
      to these misleading vma->vm_next lengths: rip all that out.
      
      Originally I intended to retain the incrementation of addr, but correct its
      initial value: valid_swaphandles generally supplies an offset below the target
      addr (this is readaround rather than readahead), but addr has not been
      adjusted accordingly, so in the interleave case it has usually been allocating
      the target page from the "wrong" node (though that may not matter very much).
      
      But look at the equivalent shmem_swapin code: either by oversight or by
      design, though it has all the apparatus for choosing a new mempolicy per page,
      it uses the same idx throughout, choosing the same mempolicy and interleave
      node for each page of the cluster.
      
      Which is actually a much better strategy: each node has its own LRUs and its
      own kswapd, so if you're betting on any particular relationship between swap
      and node, the best bet is that nearby swap entries belong to pages from the
      same node - even when the mempolicy of the target page is to interleave.  And
      examining a map of nodes corresponding to swap entries on a numa=fake system
      bears this out.  (We could later tweak swap allocation to make it even more
      likely, but this patch is merely about removing cruft.)
      
      So, neither adjust nor increment addr in swapin_readahead, and then
      shmem_swapin can use it too; the pseudo-vma to pass policy need only be set up
      once per cluster, and so few fields of pvma are used, let's skip the memset -
      from shmem_alloc_page also.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4cc6d07
  11. 29 11月, 2007 1 次提交
  12. 30 10月, 2007 1 次提交
    • H
      fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE · 487e9bf2
      Hugh Dickins 提交于
      It's possible to provoke unionfs (not yet in mainline, though in mm and
      some distros) to hit shmem_writepage's BUG_ON(page_mapped(page)).  I expect
      it's possible to provoke the 2.6.23 ecryptfs in the same way (but the
      2.6.24 ecryptfs no longer calls lower level's ->writepage).
      
      This came to light with the recent find that AOP_WRITEPAGE_ACTIVATE could
      leak from tmpfs via write_cache_pages and unionfs to userspace.  There's
      already a fix (e4230030 - writeback: don't
      propagate AOP_WRITEPAGE_ACTIVATE) in the tree for that, and it's okay so
      far as it goes; but insufficient because it doesn't address the underlying
      issue, that shmem_writepage expects to be called only by vmscan (relying on
      backing_dev_info capabilities to prevent the normal writeback path from
      ever approaching it).
      
      That's an increasingly fragile assumption, and ramdisk_writepage (the other
      source of AOP_WRITEPAGE_ACTIVATEs) is already careful to check
      wbc->for_reclaim before returning it.  Make the same check in
      shmem_writepage, thereby sidestepping the page_mapped BUG also.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Erez Zadok <ezk@cs.sunysb.edu>
      Cc: <stable@kernel.org>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      487e9bf2
  13. 22 10月, 2007 2 次提交
  14. 17 10月, 2007 1 次提交
    • D
      r/o bind mounts: filesystem helpers for custom 'struct file's · ce8d2cdf
      Dave Hansen 提交于
      Why do we need r/o bind mounts?
      
      This feature allows a read-only view into a read-write filesystem.  In the
      process of doing that, it also provides infrastructure for keeping track of
      the number of writers to any given mount.
      
      This has a number of uses.  It allows chroots to have parts of filesystems
      writable.  It will be useful for containers in the future because users may
      have root inside a container, but should not be allowed to write to
      somefilesystems.  This also replaces patches that vserver has had out of the
      tree for several years.
      
      It allows security enhancement by making sure that parts of your filesystem
      read-only (such as when you don't trust your FTP server), when you don't want
      to have entire new filesystems mounted, or when you want atime selectively
      updated.  I've been using the following script to test that the feature is
      working as desired.  It takes a directory and makes a regular bind and a r/o
      bind mount of it.  It then performs some normal filesystem operations on the
      three directories, including ones that are expected to fail, like creating a
      file on the r/o mount.
      
      This patch:
      
      Some filesystems forego the vfs and may_open() and create their own 'struct
      file's.
      
      This patch creates a couple of helper functions which can be used by these
      filesystems, and will provide a unified place which the r/o bind mount code
      may patch.
      
      Also, rename an existing, static-scope init_file() to a less generic name.
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce8d2cdf