1. 01 4月, 2009 1 次提交
    • H
      shmem: writepage directly to swap · 9fab5619
      Hugh Dickins 提交于
      Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
      swap loads benefit, and a catastrophic interaction between SLUB and some
      flash storage is avoided.
      
      shmem_writepage() has always been peculiar in making no attempt to write:
      it has just transferred a shmem page from file cache to swap cache, then
      let that page make its way around the LRU again before being written and
      freed.
      
      The idea was that people use tmpfs because they want those pages to stay
      in RAM; so although we give it an overflow to swap, we should resist
      writing too soon, giving those pages a second chance before they can be
      reclaimed.
      
      That was always questionable, and I've toyed with this patch for years;
      but never had a clear justification to depart from the original design.
      
      It became more questionable in 2.6.28, when the split LRU patches classed
      shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
      itself gives them more resistance to reclaim than normal file pages.  I
      prepared this patch for 2.6.29, but the merge window arrived before I'd
      completed gathering statistics to justify sending it in.
      
      Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
      habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
      tests five times slower than SLAB or SLQB - other machines slower too, but
      nowhere near so bad.  Simpler "cp -a" swapping tests showed the same.
      
      slub_max_order=0 brings sanity to all, but heavy swapping is too far from
      normal to justify such a tuning.  The crucial factor on that laptop turns
      out to be that I'm using an SD card for swap.  What happens is this:
      
      By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
      fs inodes), so creating tmpfs files under memory pressure brings lumpy
      reclaim into play.  One subpage of the order is chosen from the bottom of
      the LRU as usual, then the other three picked out from their random
      positions on the LRUs.
      
      In a tmpfs load, many of these pages will be ones which already passed
      through shmem_writepage, so already have swap allocated.  And though their
      offsets on swap were probably allocated sequentially, now that the pages
      are picked off at random, their swap offsets are scattered.
      
      But the flash storage on the SD card is very sensitive to having its
      writes merged: once swap is written at scattered offsets, performance
      falls apart.  Rotating disk seeks increase too, but less disastrously.
      
      So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
      out to swap as soon as their swap has been allocated.
      
      It's surely possible to devise an artificial load which runs faster the
      old way, one whose sizing is such that the tmpfs pages on their second
      pass are the ones that are wanted again, and other pages not.
      
      But I've not yet found such a load: on all machines, under the loads I've
      tried, immediate swap_writepage speeds up shmem swapping: especially when
      using the SLUB allocator (and more effectively than slub_max_order=0), but
      also with the others; and it also reduces the variance between runs.  How
      much faster varies widely: a factor of five is rare, 5% is common.
      
      One load which might have suffered: imagine a swapping shmem load in a
      limited mem_cgroup on a machine with plenty of memory.  Before 2.6.29 the
      swapcache was not charged, and such a load would have run quickest with
      the shmem swapcache never written to swap.  But now swapcache is charged,
      so even this load benefits from shmem_writepage directly to swap.
      
      Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
      it's silly because that will never get called; but refactoring shmem.c
      sensibly according to CONFIG_SWAP will be a separate task.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fab5619
  2. 26 2月, 2009 1 次提交
    • H
      shmem: fix shared anonymous accounting · 0b0a0806
      Hugh Dickins 提交于
      Each time I exit Firefox, /proc/meminfo's Committed_AS goes down almost
      400 kB: OVERCOMMIT_NEVER would be allowing overcommits it should
      prohibit.
      
      Commit fc8744ad "Stop playing silly
      games with the VM_ACCOUNT flag" changed shmem_file_setup() to set the
      shmem file's VM_ACCOUNT flag according to VM_NORESERVE not being set in
      the vma flags; but did so only _after_ the shmem_acct_size(flags, size)
      call which is expected to pre-account a shared anonymous object.
      
      It's all clearer if we switch shmem.c over to use VM_NORESERVE
      throughout in place of !VM_ACCOUNT.
      
      But I very nearly sent in a patch which mistakenly removed the
      accounting from tmpfs files: shmem_get_inode()'s memset was good for not
      setting VM_ACCOUNT, but now it needs to set VM_NORESERVE.
      
      Rather than setting that by default, then perhaps clearing it again in
      shmem_file_setup(), let's pass it as a flag to shmem_get_inode(): that
      allows us to remove the #ifdef CONFIG_SHMEM from shmem_file_setup().
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b0a0806
  3. 11 2月, 2009 1 次提交
    • M
      integrity: shmem zero fix · ed850a52
      Mimi Zohar 提交于
      Based on comments from Mike Frysinger and Randy Dunlap:
      (http://lkml.org/lkml/2009/2/9/262)
      - moved ima.h include before CONFIG_SHMEM test to fix compiler error
        on Blackfin:
      mm/shmem.c: In function 'shmem_zero_setup':
      mm/shmem.c:2670: error: implicit declaration of function 'ima_shm_check'
      
      - added 'struct linux_binprm' in ima.h to fix compiler warning on Blackfin:
      In file included from mm/shmem.c:32:
      include/linux/ima.h:25: warning: 'struct linux_binprm' declared inside
      parameter list
      include/linux/ima.h:25: warning: its scope is only this definition or
      declaration, which is probably not what you want
      
      - moved fs.h include within _LINUX_IMA_H definition
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      ed850a52
  4. 06 2月, 2009 1 次提交
    • M
      Integrity: IMA file free imbalance · 1df9f0a7
      Mimi Zohar 提交于
      The number of calls to ima_path_check()/ima_file_free()
      should be balanced.  An extra call to fput(), indicates
      the file could have been accessed without first being
      measured.
      
      Although f_count is incremented/decremented in places other
      than fget/fput, like fget_light/fput_light and get_file, the
      current task must already hold a file refcnt.  The call to
      __fput() is delayed until the refcnt becomes 0, resulting
      in ima_file_free() flagging any changes.
      
      - add hook to increment opencount for IPC shared memory(SYSV),
        shmat files, and /dev/zero
      - moved NULL iint test in opencount_get()
      Signed-off-by: NMimi Zohar <zohar@us.ibm.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      1df9f0a7
  5. 01 2月, 2009 1 次提交
    • L
      Stop playing silly games with the VM_ACCOUNT flag · fc8744ad
      Linus Torvalds 提交于
      The mmap_region() code would temporarily set the VM_ACCOUNT flag for
      anonymous shared mappings just to inform shmem_zero_setup() that it
      should enable accounting for the resulting shm object.  It would then
      clear the flag after calling ->mmap (for the /dev/zero case) or doing
      shmem_zero_setup() (for the MAP_ANON case).
      
      This just resulted in vma merge issues, but also made for just
      unnecessary confusion.  Use the already-existing VM_NORESERVE flag for
      this instead, and let shmem_{zero|file}_setup() just figure it out from
      that.
      
      This also happens to make it obvious that the new DRI2 GEM layer uses a
      non-reserving backing store for its object allocation - which is quite
      possibly not intentional.  But since I didn't want to change semantics
      in this patch, I left it alone, and just updated the caller to use the
      new flag semantics.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc8744ad
  6. 09 1月, 2009 4 次提交
    • K
      memcg: fix shmem's swap accounting · b5a84319
      KAMEZAWA Hiroyuki 提交于
      Now, you can see following even when swap accounting is enabled.
      
       1. Create Group 01, and 02.
       2. allocate a "file" on tmpfs by a task under 01.
       3. swap out the "file" (by memory pressure)
       4. Read "file" from a task in group 02.
       5. the charge of "file" is moved to group 02.
      
      This is not ideal behavior. This is because SwapCache which was loaded
      by read-ahead is not taken into account..
      
      This is a patch to fix shmem's swapcache behavior.
        - remove mem_cgroup_cache_charge_swapin().
        - Add SwapCache handler routine to mem_cgroup_cache_charge().
          By this, shmem's file cache is charged at add_to_page_cache()
          with GFP_NOWAIT.
        - pass the page of swapcache to shrink_mem_cgroup.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5a84319
    • K
      memcg: revert gfp mask fix · 2c26fdd7
      KAMEZAWA Hiroyuki 提交于
      My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
      of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
      happen at memory reclaim.
      
      But in recent discussion, it's NACKed because it sounds ugly.
      
      This patch is for reverting it and add some clean up to gfp_mask of
      callers of charge.  No behavior change but need review before generating
      HUNK in deep queue.
      
      This patch also adds explanation to meaning of gfp_mask passed to charge
      functions in memcontrol.h.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c26fdd7
    • K
      memcg: handle swap caches · d13d1443
      KAMEZAWA Hiroyuki 提交于
      SwapCache support for memory resource controller (memcg)
      
      Before mem+swap controller, memcg itself should handle SwapCache in proper
      way.  This is cut-out from it.
      
      In current memcg, SwapCache is just leaked and the user can create tons of
      SwapCache.  This is a leak of account and should be handled.
      
      SwapCache accounting is done as following.
      
        charge (anon)
      	- charged when it's mapped.
      	  (because of readahead, charge at add_to_swap_cache() is not sane)
        uncharge (anon)
      	- uncharged when it's dropped from swapcache and fully unmapped.
      	  means it's not uncharged at unmap.
      	  Note: delete from swap cache at swap-in is done after rmap information
      	        is established.
        charge (shmem)
      	- charged at swap-in. this prevents charge at add_to_page_cache().
      
        uncharge (shmem)
      	- uncharged when it's dropped from swapcache and not on shmem's
      	  radix-tree.
      
        at migration, check against 'old page' is modified to handle shmem.
      
      Comparing to the old version discussed (and caused troubles), we have
      advantages of
        - PCG_USED bit.
        - simple migrating handling.
      
      So, situation is much easier than several months ago, maybe.
      
      [hugh@veritas.com: memcg: handle swap caches build fix]
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d13d1443
    • K
      memcg: fix gfp_mask of callers of charge · bced0520
      KAMEZAWA Hiroyuki 提交于
      Fix misuse of gfp_kernel.
      
      Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.
      
      I think that this is from the fact that page_cgroup *was* dynamically
      allocated.
      
      But now, we allocate all page_cgroup at boot.  And
      mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
      specified GFP_RECLAIM_MASK.
      
        * This is because we just want to reduce memory usage.
          "Where we should reclaim from ?" is not a problem in memcg.
      
      This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
      
      Note: This patch is not for fixing behavior but for showing sane information
            in source code.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bced0520
  7. 07 1月, 2009 2 次提交
  8. 14 11月, 2008 1 次提交
  9. 31 10月, 2008 1 次提交
  10. 20 10月, 2008 3 次提交
  11. 18 10月, 2008 1 次提交
  12. 13 10月, 2008 1 次提交
  13. 05 8月, 2008 1 次提交
  14. 29 7月, 2008 1 次提交
  15. 27 7月, 2008 2 次提交
  16. 26 7月, 2008 2 次提交
    • K
      memcg: helper function for relcaim from shmem. · c9b0ed51
      KAMEZAWA Hiroyuki 提交于
      A new call, mem_cgroup_shrink_usage() is added for shmem handling and
      relacing non-standard usage of mem_cgroup_charge/uncharge.
      
      Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
      mem_cgroup.  In general, shmem is used by some process group and not for
      global resource (like file caches).  So, it's reasonable to reclaim pages
      from mem_cgroup where shmem is mainly used.
      
      [hugh@veritas.com: shmem_getpage release page sooner]
      [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b0ed51
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
  17. 25 7月, 2008 1 次提交
    • H
      tmpfs: support aio · bcd78e49
      Hugh Dickins 提交于
      We have a request for tmpfs to support the AIO interface: easily done, no
      more than replacing the old shmem_file_read by shmem_file_aio_read,
      cribbed from generic_file_aio_read.  (In 2.6.25 its write side was already
      changed to use generic_file_aio_write.)
      
      Incorporate cleanups from Andrew Morton and Harvey Harrison.
      
      Tests out fine with LTP's ltp-aiodio.sh, given hacks (not included) to
      support O_DIRECT.  tmpfs cannot honestly support O_DIRECT: its
      cache-avoiding-IO nature is at odds with direct IO-avoiding-cache.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Tested-by: NLawrence Greenfield <leg@google.com>
      Cc: Christoph Rohland <hans-christoph.rohland@sap.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcd78e49
  18. 30 4月, 2008 1 次提交
  19. 28 4月, 2008 9 次提交
    • L
      mempolicy: use struct mempolicy pointer in shmem_sb_info · 71fe804b
      Lee Schermerhorn 提交于
      This patch replaces the mempolicy mode, mode_flags, and nodemask in the
      shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
      This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
      inode.c and simplifies the interfaces.
      
      mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
      pointer arg, a struct mempolicy pointer on success.  For MPOL_DEFAULT, the
      returned pointer is NULL.  Further, mpol_parse_str() now takes a 'no_context'
      argument that causes the input nodemask to be stored in the w.user_nodemask of
      the created mempolicy for use when the mempolicy is installed in a tmpfs inode
      shared policy tree.  At that time, any cpuset contextualization is applied to
      the original input nodemask.  This preserves the previous behavior where the
      input nodemask was stored in the superblock.  We can think of the returned
      mempolicy as "context free".
      
      Because mpol_parse_str() is now calling mpol_new(), we can remove from
      mpol_to_str() the semantic checks that mpol_new() already performs.
      
      Add 'no_context' parameter to mpol_to_str() to specify that it should format
      the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.
      
      Change mpol_shared_policy_init() to take a pointer to a "context free" struct
      mempolicy and to create a new, "contextualized" mempolicy using the mode,
      mode_flags and user_nodemask from the input mempolicy.
      
        Note: we know that the mempolicy passed to mpol_to_str() or
        mpol_shared_policy_init() from a tmpfs superblock is "context free".  This
        is currently the only instance thereof.  However, if we found more uses for
        this concept, and introduced any ambiguity as to whether a mempolicy was
        context free or not, we could add another internal mode flag to identify
        context free mempolicies.  Then, we could remove the 'no_context' argument
        from mpol_to_str().
      
      Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
      if one exists, to pass to mpol_shared_policy_init().  We must add the
      reference under the sb stat_lock to prevent races with replacement of the mpol
      by remount.  This reference is removed in mpol_shared_policy_init().
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: yet another build fix]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe804b
    • L
      mempolicy: rework shmem mpol parsing and display · 095f1fc4
      Lee Schermerhorn 提交于
      mm/shmem.c currently contains functions to parse and display memory policy
      strings for the tmpfs 'mpol' mount option.  Move this to mm/mempolicy.c with
      the rest of the mempolicy support.  With subsequent patches, we'll be able to
      remove knowledge of the details [mode, flags, policy, ...] completely from
      shmem.c
      
      1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
         mm/mempolicy.c.  Rework to use the policy_types[] array [used by
         mpol_to_str()] to look up mode by name.
      
      2) use mpol_to_str() to format policy for shmem_show_mpol().  mpol_to_str()
         expects a pointer to a struct mempolicy, so temporarily construct one.
         This will be replaced with a reference to a struct mempolicy in the tmpfs
         superblock in a subsequent patch.
      
         NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
         sign '=' as the nodemask delimiter to match mpol_parse_str() and the
         tmpfs/shmem mpol mount option formatting that now uses mpol_to_str().  This
         is a user visible change to numa_maps, but then the addition of the mode
         flags already changed the display.  It makes sense to me to have the mounts
         and numa_maps display the policy in the same format.  However, if anyone
         objects strongly, I can pass the desired nodemask delimeter as an arg to
         mpol_to_str().
      
         Note 2: Like show_numa_map(), I don't check the return code from
         mpol_to_str().  I do use a longer buffer than the one provided by
         show_numa_map(), which seems to have sufficed so far.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      095f1fc4
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn 提交于
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • L
      mempolicy: fix parsing of tmpfs mpol mount option · a43361cf
      Lee Schermerhorn 提交于
      Parsing of new mode flags in the tmpfs mpol mount option is slightly broken:
      
      Setting a valid flag works OK:
      	#mount -o remount,mpol=bind=static:1-2 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2)
      	...
      
      However, we can't remove them or change them, once we've
      set a valid flag:
      
      	#mount -o remount,mpol=bind:1-2 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2)
      	...
      
      It SAYS it removed it, but that's just a copy of the input
      string.  If we now try to set it to a different flag, we
      get:
      
      	#mount -o remount,mpol=bind=relative:1-2 /dev/shm
      	mount: /dev/shm not mounted already, or bad option
      
      And on the console, we see:
      	tmpfs: Bad value 'bind' for mount option 'mpol'
      	                      ^ lost remainder of string
      
      Furthermore, bogus flags are accepted with out error.
      Granted, they are a no-op:
      
      	#mount -o remount,mpol=interleave=foo:0-3 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3)
      
      Again, that's just a copy of the input string shown by the mount command.
      
      This patch fixes the behavior by pre-zeroing the flags so that only one of the
      mutually exclusive flags can be set at one time.  It also reports an error
      when an unrecognized flag is specified.
      
      The check for both flags being set is removed because it can't happen with
      this implementation.  If we ever want to support multiple non-exclusive flags,
      this area will need rework and we will need to check that any mutually
      exclusive flags aren't specified.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43361cf
    • D
      mempolicy: add MPOL_F_RELATIVE_NODES flag · 4c50bc01
      David Rientjes 提交于
      Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
      nodemasks passed via set_mempolicy() or mbind() should be considered relative
      to the current task's mems_allowed.
      
      When the mempolicy is created, the passed nodemask is folded and mapped onto
      the current task's mems_allowed.  For example, consider a task using
      set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
      nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
      5-7 (the second, third, and fourth node of mems_allowed).
      
      If the same task is attached to a cpuset, the mempolicy nodemask is rebound
      each time the mems are changed.  Some possible rebinds and results are:
      
      	mems			result
      	1-3			1-3
      	1-7			2-4
      	1,5-6			1,5-6
      	1,5-7			5-7
      
      Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
      to the resultant nodemask from the relative remap.
      
      In the MPOL_PREFERRED case, the preferred node is remapped from the currently
      effected nodemask to the relative nodemask.
      
      This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c50bc01
    • D
      mempolicy: add MPOL_F_STATIC_NODES flag · f5b087b5
      David Rientjes 提交于
      Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
      node remap when the policy is rebound.
      
      Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
      a union with cpuset_mems_allowed:
      
      	struct mempolicy {
      		...
      		union {
      			nodemask_t cpuset_mems_allowed;
      			nodemask_t user_nodemask;
      		} w;
      	}
      
      that stores the the nodemask that the user passed when he or she created the
      mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
      which is passed with any mempolicy mode, the user's passed nodemask
      intersected with the VMA or task's allowed nodes is always used when
      determining the preferred node, setting the MPOL_BIND zonelist, or creating
      the interleave nodemask.  This happens whenever the policy is rebound,
      including when a task's cpuset assignment changes or the cpuset's mems are
      changed.
      
      This creates an interesting side-effect in that it allows the mempolicy
      "intent" to lie dormant and uneffected until it has access to the node(s) that
      it desires.  For example, if you currently ask for an interleaved policy over
      a set of nodes that you do not have access to, the mempolicy is not created
      and the task continues to use the previous policy.  With this change, however,
      it is possible to create the same mempolicy; it is only effected when access
      to nodes in the nodemask is acquired.
      
      It is also possible to mount tmpfs with the static nodemask behavior when
      specifying a node or nodemask.  To do this, simply add "=static" immediately
      following the mempolicy mode at mount time:
      
      	mount -o remount mpol=interleave=static:1-3
      
      Also removes mpol_check_policy() and folds its logic into mpol_new() since it
      is now obsoleted.  The unused vma_mpol_equal() is also removed.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5b087b5
    • D
      mempolicy: support optional mode flags · 028fec41
      David Rientjes 提交于
      With the evolution of mempolicies, it is necessary to support mempolicy mode
      flags that specify how the policy shall behave in certain circumstances.  The
      most immediate need for mode flag support is to suppress remapping the
      nodemask of a policy at the time of rebind.
      
      Both the mempolicy mode and flags are passed by the user in the 'int policy'
      formal of either the set_mempolicy() or mbind() syscall.  A new constant,
      MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
      passed as part of this int.  Mempolicies that include illegal flags as part of
      their policy are rejected as invalid.
      
      An additional member to struct mempolicy is added to support the mode flags:
      
      	struct mempolicy {
      		...
      		unsigned short policy;
      		unsigned short flags;
      	}
      
      The splitting of the 'int' actual passed by the user is done in
      sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
      done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
      there are additional flags, and storing it in the new 'flags' member of struct
      mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
      the 'policy' member of the struct and all current users of pol->policy remain
      unchanged.
      
      The union of the policy mode and optional mode flags is passed back to the
      user in get_mempolicy().
      
      This combination of mode and flags within the same actual does not break
      userspace code that relies on get_mempolicy(&policy, ...) and either
      
      	switch (policy) {
      	case MPOL_BIND:
      		...
      	case MPOL_INTERLEAVE:
      		...
      	};
      
      statements or
      
      	if (policy == MPOL_INTERLEAVE) {
      		...
      	}
      
      statements.  Such applications would need to use optional mode flags when
      calling set_mempolicy() or mbind() for these previously implemented statements
      to stop working.  If an application does start using optional mode flags, it
      will need to mask the optional flags off the policy in switch and conditional
      statements that only test mode.
      
      An additional member is also added to struct shmem_sb_info to store the
      optional mode flags.
      
      [hugh@veritas.com: shmem mpol: fix build warning]
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fec41
    • D
      mempolicy: convert MPOL constants to enum · a3b51e01
      David Rientjes 提交于
      The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
      MPOL_INTERLEAVE, are better declared as part of an enum since they are
      sequentially numbered and cannot be combined.
      
      The policy member of struct mempolicy is also converted from type short to
      type unsigned short.  A negative policy does not have any legitimate meaning,
      so it is possible to change its type in preparation for adding optional mode
      flags later.
      
      The equivalent member of struct shmem_sb_info is also changed from int to
      unsigned short.
      
      For compatibility, the policy formal to get_mempolicy() remains as a pointer
      to an int:
      
      	int get_mempolicy(int *policy, unsigned long *nmask,
      			  unsigned long maxnode, unsigned long addr,
      			  unsigned long flags);
      
      although the only possible values is the range of type unsigned short.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b51e01
  20. 20 3月, 2008 1 次提交
  21. 05 3月, 2008 1 次提交
  22. 09 2月, 2008 2 次提交
  23. 08 2月, 2008 1 次提交
    • H
      memcgroup: fix hang with shmem/tmpfs · 82369553
      Hugh Dickins 提交于
      The memcgroup regime relies upon a cgroup reclaiming pages from itself within
      add_to_page_cache: which may involve some waiting.  Whereas shmem and tmpfs
      rely upon using add_to_page_cache while holding a spinlock: when it cannot
      wait.  The consequence is that when a cgroup reaches its limit, shmem_getpage
      just hangs - unless there is outside memory pressure too, neither kswapd nor
      radix_tree_preload get it out of the retry loop.
      
      In most cases we can mem_cgroup_cache_charge the page waitably first, to
      attach the page_cgroup in advance, so add_to_page_cache will do no more than
      increment a count; then mem_cgroup_uncharge_page after (in both success and
      failure cases) to balance the books again.
      
      And where there used to be a congestion_wait for kswapd (recently made
      redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
      to go through a cycle of allocation and freeing, without accounting to any
      particular page, and without updating the statistics vector.  This brings the
      cgroup below its limit so the next try usually succeeds.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82369553