1. 07 11月, 2008 1 次提交
    • C
      mm: move migrate_prep out from under mmap_sem · 0aedadf9
      Christoph Lameter 提交于
      Move the migrate_prep outside the mmap_sem for the following system calls
      
      1. sys_move_pages
      2. sys_migrate_pages
      3. sys_mbind()
      
      It really does not matter when we flush the lru.  The system is free to
      add pages onto the lru even during migration which will make the page
      migration either skip the page (mbind, migrate_pages) or return a busy
      state (move_pages).
      
      Fixes this lockdep warning (and potential deadlock):
      
      Some VM place has
            mmap_sem -> kevent_wq via lru_add_drain_all()
      
      net/core/dev.c::dev_ioctl()  has
           rtnl_lock  ->  mmap_sem        (*) the ioctl has copy_from_user() and it can do page fault.
      
      linkwatch_event has
           kevent_wq -> rtnl_lock
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aedadf9
  2. 20 10月, 2008 9 次提交
    • K
      memcg: make page->mapping NULL before uncharge · b7abea96
      KAMEZAWA Hiroyuki 提交于
      This patch tries to make page->mapping to be NULL before
      mem_cgroup_uncharge_cache_page() is called.
      
      "page->mapping == NULL" is a good check for "whether the page is still
      radix-tree or not".  This patch also adds BUG_ON() to
      mem_cgroup_uncharge_cache_page();
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7abea96
    • B
      mm: extract do_pages_move() out of sys_move_pages() · 5e9a0f02
      Brice Goglin 提交于
      To prepare the chunking, move the sys_move_pages() code that is used when
      nodes!=NULL into do_pages_move().  And rename do_move_pages() into
      do_move_page_to_node_array().
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e9a0f02
    • B
      mm: don't vmalloc a huge page_to_node array for do_pages_stat() · 2f007e74
      Brice Goglin 提交于
      do_pages_stat() does not need any page_to_node entry for real.  Just pass
      the pointers to the user-space page address array and to the user-space
      status array, and have do_pages_stat() traverse the former and fill the
      latter directly.
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f007e74
    • B
      mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated · e78bbfa8
      Brice Goglin 提交于
      A patchset reworking sys_move_pages().  It removes the possibly large
      vmalloc by using multiple chunks when migrating large buffers.  It also
      dramatically increases the throughput for large buffers since the lookup
      in new_page_node() is now limited to a single chunk, causing the quadratic
      complexity to have a much slower impact.  There is no need to use any
      radix-tree-like structure to improve this lookup.
      
      sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
      migrating between nodes #2 and #3:
      
      	length		move_pages (us)		move_pages+patch (us)
      	4kB		126			98
      	40kB		198			168
      	400kB		963			937
      	4MB		12503			11930
      	40MB		246867			11848
      
      Patches #1 and #4 are the important ones:
      1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
      2) don't vmalloc a huge page_to_node array for do_pages_stat()
      3) extract do_pages_move() out of sys_move_pages()
      4) rework do_pages_move() to work on page_sized chunks
      5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default
      
      This patch:
      
      There is no point in returning -ENOENT from sys_move_pages() if all pages
      were already on the right node, while we return 0 if only 1 page was not.
      Most application don't know where their pages are allocated, so it's not
      an error to try to migrate them anyway.
      
      Just return 0 and let the status array in user-space be checked if the
      application needs details.
      
      It will make the upcoming chunked-move_pages() support much easier.
      Signed-off-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e78bbfa8
    • N
      mlock: mlocked pages are unevictable · b291f000
      Nick Piggin 提交于
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
    • L
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn 提交于
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
    • R
      define page_file_cache() function · b2e18538
      Rik van Riel 提交于
      Define page_file_cache() function to answer the question:
      	is page backed by a file?
      
      Originally part of Rik van Riel's split-lru patch.  Extracted to make
      available for other, independent reclaim patches.
      
      Moved inline function to linux/mm_inline.h where it will be needed by
      subsequent "split LRU" and "noreclaim" patches.
      
      Unfortunately this needs to use a page flag, since the PG_swapbacked state
      needs to be preserved all the way to the point where the page is last
      removed from the LRU.  Trying to derive the status from other info in the
      page resulted in wrong VM statistics in earlier split VM patchsets.
      
      The total number of page flags in use on a 32 bit machine after this patch
      is 19.
      
      [akpm@linux-foundation.org: fix up out-of-order merge fallout]
      [hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMinChan Kim <minchan.kim@gmail.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2e18538
    • K
      swap: use an array for the LRU pagevecs · f04e9ebb
      KOSAKI Motohiro 提交于
      Turn the pagevecs into an array just like the LRUs.  This significantly
      cleans up the source code and reduces the size of the kernel by about 13kB
      after all the LRU lists have been created further down in the split VM
      patch series.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f04e9ebb
    • N
      vmscan: move isolate_lru_page() to vmscan.c · 62695a84
      Nick Piggin 提交于
      On large memory systems, the VM can spend way too much time scanning
      through pages that it cannot (or should not) evict from memory.  Not only
      does it use up CPU time, but it also provokes lock contention and can
      leave large systems under memory presure in a catatonic state.
      
      This patch series improves VM scalability by:
      
      1) putting filesystem backed, swap backed and unevictable pages
         onto their own LRUs, so the system only scans the pages that it
         can/should evict from memory
      
      2) switching to two handed clock replacement for the anonymous LRUs,
         so the number of pages that need to be scanned when the system
         starts swapping is bound to a reasonable number
      
      3) keeping unevictable pages off the LRU completely, so the
         VM does not waste CPU time scanning them. ramfs, ramdisk,
         SHM_LOCKED shared memory segments and mlock()ed VMA pages
         are keept on the unevictable list.
      
      This patch:
      
      isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
      
      It is tough, because we don't need that function without memory migration
      so there is a valid argument to have it in migrate.c.  However a
      subsequent patch needs to make use of it in the core mm, so we can happily
      move it to vmscan.c.
      
      Also, make the function a little more generic by not requiring that it
      adds an isolated page to a given list.  Callers can do that.
      
      	Note that we now have '__isolate_lru_page()', that does
      	something quite different, visible outside of vmscan.c
      	for use with memory controller.  Methinks we need to
      	rationalize these names/purposes.	--lts
      
      [akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62695a84
  3. 05 8月, 2008 1 次提交
  4. 27 7月, 2008 2 次提交
  5. 26 7月, 2008 2 次提交
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
    • K
      memcg: better migration handling · e8589cc1
      KAMEZAWA Hiroyuki 提交于
      This patch changes page migration under memory controller to use a
      different algorithm.  (thanks to Christoph for new idea.)
      
      Before:
       - page_cgroup is migrated from an old page to a new page.
      After:
       - a new page is accounted , no reuse of page_cgroup.
      
      Pros:
      
       - We can avoid compliated lock depndencies and races in migration.
      
      Cons:
      
       - new param to mem_cgroup_charge_common().
      
       - mem_cgroup_getref() is added for handling ref_cnt ping-pong.
      
      This version simplifies complicated lock dependency in page migraiton
      under memory resource controller.
      
        new refcnt sequence is following.
      
      a mapped page:
        prepage_migration() ..... +1 to NEW page
        try_to_unmap()      ..... all refs to OLD page is gone.
        move_pages()        ..... +1 to NEW page if page cache.
        remap...            ..... all refs from *map* is added to NEW one.
        end_migration()     ..... -1 to New page.
      
        page's mapcount + (page_is_cache) refs are added to NEW one.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8589cc1
  6. 25 7月, 2008 2 次提交
  7. 05 7月, 2008 1 次提交
  8. 21 6月, 2008 1 次提交
    • L
      Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP · 89f5b7da
      Linus Torvalds 提交于
      KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit
      557ed1fa ("remove ZERO_PAGE") removed
      the ZERO_PAGE from the VM mappings, any users of get_user_pages() will
      generally now populate the VM with real empty pages needlessly.
      
      We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but
      since fault handling no longer uses ZERO_PAGE for new anonymous pages,
      we now need to handle that special case in follow_page() instead.
      
      In particular, the removal of ZERO_PAGE effectively removed the core
      file writing optimization where we would skip writing pages that had not
      been populated at all, and increased memory pressure a lot by allocating
      all those useless newly zeroed pages.
      
      This reinstates the optimization by making the unmapped PTE case the
      same as for a non-existent page table, which already did this correctly.
      
      While at it, this also fixes the XIP case for follow_page(), where the
      caller could not differentiate between the case of a page that simply
      could not be used (because it had no "struct page" associated with it)
      and a page that just wasn't mapped.
      
      We do that by simply returning an error pointer for pages that could not
      be turned into a "struct page *".  The error is arbitrarily picked to be
      EFAULT, since that was what get_user_pages() already used for the
      equivalent IO-mapped page case.
      
      [ Also removed an impossible test for pte_offset_map_lock() failing:
        that's not how that function works ]
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89f5b7da
  9. 30 4月, 2008 1 次提交
    • N
      mm: fix warning on memory offline · 3a902c5f
      Nick Piggin 提交于
      KAMEZAWA Hiroyuki found a warning message in the buffer dirtying code that
      is coming from page migration caller.
      
      WARNING: at fs/buffer.c:720 __set_page_dirty+0x330/0x360()
      Call Trace:
       [<a000000100015220>] show_stack+0x80/0xa0
       [<a000000100015270>] dump_stack+0x30/0x60
       [<a000000100089ed0>] warn_on_slowpath+0x90/0xe0
       [<a0000001001f8b10>] __set_page_dirty+0x330/0x360
       [<a0000001001ffb90>] __set_page_dirty_buffers+0xd0/0x280
       [<a00000010012fec0>] set_page_dirty+0xc0/0x260
       [<a000000100195670>] migrate_page_copy+0x5d0/0x5e0
       [<a000000100197840>] buffer_migrate_page+0x2e0/0x3c0
       [<a000000100195eb0>] migrate_pages+0x770/0xe00
      
      What was happening is that migrate_page_copy wants to transfer the PG_dirty
      bit from old page to new page, so what it would do is set_page_dirty(newpage).
      However set_page_dirty() is used to set the entire page dirty, wheras in
      this case, only part of the page was dirty, and it also was not uptodate.
      
      Marking the whole page dirty with set_page_dirty would lead to corruption or
      unresolvable conditions -- a dirty && !uptodate page and dirty && !uptodate
      buffers.
      
      Possibly we could just ClearPageDirty(oldpage); SetPageDirty(newpage);
      however in the interests of keeping the change minimal...
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a902c5f
  10. 05 3月, 2008 1 次提交
    • H
      memcg: fix VM_BUG_ON from page migration · 98837c7f
      Hugh Dickins 提交于
      Page migration gave me free_hot_cold_page's VM_BUG_ON page->page_cgroup.
      remove_migration_pte was calling mem_cgroup_charge on the new page whenever it
      found a swap pte, before it had determined it to be a migration entry.  That
      left a surplus reference count on the page_cgroup, so it was still attached
      when the page was later freed.
      
      Move that mem_cgroup_charge down to where we're sure it's a migration entry.
      We were already under i_mmap_lock or anon_vma->lock, so its GFP_KERNEL was
      already inappropriate: change that to GFP_ATOMIC.
      
      It's essential that remove_migration_pte removes all the migration entries,
      other crashes follow if not.  So proceed even when the charge fails: normally
      it cannot, but after a mem_cgroup_force_empty it might - comment in the code.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98837c7f
  11. 08 2月, 2008 3 次提交
    • K
      bugfix for memory cgroup controller: migration under memory controller fix · ae41be37
      KAMEZAWA Hiroyuki 提交于
      While using memory control cgroup, page-migration under it works as following.
      ==
       1. uncharge all refs at try to unmap.
       2. charge regs again remove_migration_ptes()
      ==
      This is simple but has following problems.
      ==
       The page is uncharged and charged back again if *mapped*.
          - This means that cgroup before migration can be different from one after
            migration
          - If page is not mapped but charged as page cache, charge is just ignored
            (because not mapped, it will not be uncharged before migration)
            This is memory leak.
      ==
      This patch tries to keep memory cgroup at page migration by increasing
      one refcnt during it. 3 functions are added.
      
       mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
       mem_cgroup_end_migration()     --- decrease refcnt of page->page_cgroup
       mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
                                       new page.
      
      During migration
        - old page is under PG_locked.
        - new page is under PG_locked, too.
        - both old page and new page is not on LRU.
      
      These 3 facts guarantee that page_cgroup() migration has no race.
      
      Tested and worked well in x86_64/fake-NUMA box.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae41be37
    • B
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh 提交于
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • B
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh 提交于
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
  12. 06 2月, 2008 2 次提交
  13. 20 10月, 2007 3 次提交
    • G
      Typo fixes retrun -> return · e9534b3f
      Gabriel Craciunescu 提交于
      Typo fixes retrun -> return
      Signed-off-by: NGabriel Craciunescu <nix.or.die@googlemail.com>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      e9534b3f
    • P
      Uninline find_task_by_xxx set of functions · 228ebcbe
      Pavel Emelyanov 提交于
      The find_task_by_something is a set of macros are used to find task by pid
      depending on what kind of pid is proposed - global or virtual one.  All of
      them are wrappers above the most generic one - find_task_by_pid_type_ns() -
      and just substitute some args for it.
      
      It turned out, that dereferencing the current->nsproxy->pid_ns construction
      and pushing one more argument on the stack inline cause kernel text size to
      grow.
      
      This patch moves all this stuff out-of-line into kernel/pid.c.  Together
      with the next patch it saves a bit less than 400 bytes from the .text
      section.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      228ebcbe
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
  14. 17 10月, 2007 3 次提交
  15. 15 10月, 2007 1 次提交
  16. 31 8月, 2007 1 次提交
  17. 27 7月, 2007 2 次提交
  18. 18 7月, 2007 1 次提交
    • M
      Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated · 769848c0
      Mel Gorman 提交于
      It is often known at allocation time whether a page may be migrated or not.
      This patch adds a flag called __GFP_MOVABLE and a new mask called
      GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
      using the page migration mechanism or reclaimed by syncing with backing
      storage and discarding.
      
      An API function very similar to alloc_zeroed_user_highpage() is added for
      __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
      flags used by alloc_zeroed_user_highpage() are not changed because it would
      change the semantics of an existing API.  After this patch is applied there
      are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
      be marked deprecated if this patch is merged.
      
      Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
      shmem.c to keep all flag modifications to inode->mapping in the
      shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
      Hugh Dickens.
      
      Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
      concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
      and ramfs allocations.
      
      [akpm@linux-foundation.org: build fix]
      [hugh@veritas.com: __GFP_ZERO cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      769848c0
  19. 24 4月, 2007 1 次提交
  20. 05 3月, 2007 1 次提交
  21. 08 12月, 2006 1 次提交
    • N
      [PATCH] radix-tree: RCU lockless readside · 7cf9c2c7
      Nick Piggin 提交于
      Make radix tree lookups safe to be performed without locks.  Readers are
      protected against nodes being deleted by using RCU based freeing.  Readers
      are protected against new node insertion by using memory barriers to ensure
      the node itself will be properly written before it is visible in the radix
      tree.
      
      Each radix tree node keeps a record of their height (above leaf nodes).
      This height does not change after insertion -- when the radix tree is
      extended, higher nodes are only inserted in the top.  So a lookup can take
      the pointer to what is *now* the root node, and traverse down it even if
      the tree is concurrently extended and this node becomes a subtree of a new
      root.
      
      "Direct" pointers (tree height of 0, where root->rnode points directly to
      the data item) are handled by using the low bit of the pointer to signal
      whether rnode is a direct pointer or a pointer to a radix tree node.
      
      When a reader wants to traverse the next branch, they will take a copy of
      the pointer.  This pointer will be either NULL (and the branch is empty) or
      non-NULL (and will point to a valid node).
      
      [akpm@osdl.org: cleanups]
      [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications]
      [clameter@sgi.com: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7cf9c2c7