1. 16 12月, 2009 40 次提交
    • J
      procfs: allow threads to rename siblings via /proc/pid/tasks/tid/comm · 4614a696
      john stultz 提交于
      Setting a thread's comm to be something unique is a very useful ability
      and is helpful for debugging complicated threaded applications.  However
      currently the only way to set a thread name is for the thread to name
      itself via the PR_SET_NAME prctl.
      
      However, there may be situations where it would be advantageous for a
      thread dispatcher to be naming the threads its managing, rather then
      having the threads self-describe themselves.  This sort of behavior is
      available on other systems via the pthread_setname_np() interface.
      
      This patch exports a task's comm via proc/pid/comm and
      proc/pid/task/tid/comm interfaces, and allows thread siblings to write to
      these values.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Mike Fulton <fultonm@ca.ibm.com>
      Cc: Sean Foley <Sean_Foley@ca.ibm.com>
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4614a696
    • S
      procfs: use proper units for noMMU statm · 7e1e0ef2
      Steven J. Magnani 提交于
      On no-MMU systems, sizes reported in /proc/n/statm have units of bytes.
      Per Documentation/filesystems/proc.txt, these values should be in pages.
      Signed-off-by: NSteven J. Magnani <steve@digidescorp.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e1e0ef2
    • J
      nommu: fix malloc performance by adding uninitialized flag · ea637639
      Jie Zhang 提交于
      The NOMMU code currently clears all anonymous mmapped memory.  While this
      is what we want in the default case, all memory allocation from userspace
      under NOMMU has to go through this interface, including malloc() which is
      allowed to return uninitialized memory.  This can easily be a significant
      performance penalty.  So for constrained embedded systems were security is
      irrelevant, allow people to avoid clearing memory unnecessarily.
      
      This also alters the ELF-FDPIC binfmt such that it obtains uninitialised
      memory for the brk and stack region.
      Signed-off-by: NJie Zhang <jie.zhang@analog.com>
      Signed-off-by: NRobin Getz <rgetz@blackfin.uclinux.org>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NGreg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea637639
    • N
      mm hugetlb: add hugepage support to pagemap · 5dc37642
      Naoya Horiguchi 提交于
      This patch enables extraction of the pfn of a hugepage from
      /proc/pid/pagemap in an architecture independent manner.
      
      Details
      -------
      My test program (leak_pagemap) works as follows:
       - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
       - read()/write() something on it,
       - call page-types with option -p,
       - munmap() and unlink() the file on hugetlbfs
      
      Without my patches
      ------------------
      $ ./leak_pagemap
                   flags page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          1        0  __________________________________
      0x0000000000000804          1        0  __R________M______________________ referenced,mmap
      0x000000000000086c         81        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
      0x0000000000005808          5        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
      0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total        101        0
      
      The output of page-types don't show any hugepage.
      
      With my patches
      ---------------
      $ ./leak_pagemap
                   flags page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000000          1        0  __________________________________
      0x0000000000030000      51100      199  ________________TG________________ compound_tail,huge
      0x0000000000028018        100        0  ___UD__________H_G________________ uptodate,dirty,compound_head,huge
      0x0000000000000804          1        0  __R________M______________________ referenced,mmap
      0x000000000000080c          1        0  __RU_______M______________________ referenced,uptodate,mmap
      0x000000000000086c         80        0  __RU_lA____M______________________ referenced,uptodate,lru,active,mmap
      0x0000000000005808          4        0  ___U_______Ma_b___________________ uptodate,mmap,anonymous,swapbacked
      0x0000000000005868         12        0  ___U_lA____Ma_b___________________ uptodate,lru,active,mmap,anonymous,swapbacked
      0x000000000000586c          1        0  __RU_lA____Ma_b___________________ referenced,uptodate,lru,active,mmap,anonymous,swapbacked
                   total      51300      200
      
      The output of page-types shows 51200 pages contributing to hugepages,
      containing 100 head pages and 51100 tail pages as expected.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dc37642
    • N
      mm: hugetlb: fix hugepage memory leak in walk_page_range() · d33b9f45
      Naoya Horiguchi 提交于
      Most callers of pmd_none_or_clear_bad() check whether the target page is
      in a hugepage or not, but walk_page_range() do not check it.  So if we
      read /proc/pid/pagemap for the hugepage on x86 machine, the hugepage
      memory is leaked as shown below.  This patch fixes it.
      
      Details
      =======
      My test program (leak_pagemap) works as follows:
       - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
       - read()/write() something on it,
       - call page-types with option -p (walk around the page tables),
       - munmap() and unlink() the file on hugetlbfs
      
      Without my patches
      ------------------
      $ cat /proc/meminfo |grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:     1000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ./leak_pagemap
      [snip output]
      $ cat /proc/meminfo |grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:      900
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ls /hugetlbfs/
      $
      
      100 hugepages are accounted as used while there is no file on hugetlbfs.
      
      With my patches
      ---------------
      $ cat /proc/meminfo |grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:     1000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ./leak_pagemap
      [snip output]
      $ cat /proc/meminfo |grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:     1000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ls /hugetlbfs
      $
      
      No memory leaks.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d33b9f45
    • N
      mm: hugetlb: fix hugepage memory leak in mincore() · 4f16fc10
      Naoya Horiguchi 提交于
      Most callers of pmd_none_or_clear_bad() check whether the target page is
      in a hugepage or not, but mincore() and walk_page_range() do not check it.
       So if we use mincore() on a hugepage on x86 machine, the hugepage memory
      is leaked as shown below.  This patch fixes it by extending mincore()
      system call to support hugepages.
      
      Details
      =======
      My test program (leak_mincore) works as follows:
       - creat() and mmap() a file on hugetlbfs (file size is 200MB == 100 hugepages,)
       - read()/write() something on it,
       - call mincore() for first ten pages and printf() the values of *vec
       - munmap() and unlink() the file on hugetlbfs
      
      Without my patch
      ----------------
      $ cat /proc/meminfo| grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:     1000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ./leak_mincore
      vec[0] 0
      vec[1] 0
      vec[2] 0
      vec[3] 0
      vec[4] 0
      vec[5] 0
      vec[6] 0
      vec[7] 0
      vec[8] 0
      vec[9] 0
      $ cat /proc/meminfo |grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:      999
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ls /hugetlbfs/
      $
      
      Return values in *vec from mincore() are set to 0, while the hugepage
      should be in memory, and 1 hugepage is still accounted as used while
      there is no file on hugetlbfs.
      
      With my patch
      -------------
      $ cat /proc/meminfo| grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:     1000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ./leak_mincore
      vec[0] 1
      vec[1] 1
      vec[2] 1
      vec[3] 1
      vec[4] 1
      vec[5] 1
      vec[6] 1
      vec[7] 1
      vec[8] 1
      vec[9] 1
      $ cat /proc/meminfo |grep "HugePage"
      HugePages_Total:    1000
      HugePages_Free:     1000
      HugePages_Rsvd:        0
      HugePages_Surp:        0
      $ ls /hugetlbfs/
      $
      
      Return value in *vec set to 1 and no memory leaks.
      
      [akpm@linux-foundation.org: cleanup]
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f16fc10
    • M
      hugetlb: abort a hugepage pool resize if a signal is pending · 536240f2
      Mel Gorman 提交于
      If a user asks for a hugepage pool resize but specified a large number,
      the machine can begin trashing.  In response, they might hit ctrl-c but
      signals are ignored and the pool resize continues until it fails an
      allocation.  This can take a considerable amount of time so this patch
      aborts a pool resize if a signal is pending.
      
      Suggested by Dave Hansen.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      536240f2
    • L
      mlock: replace stale comments in munlock_vma_page() · 6927c1dd
      Lee Schermerhorn 提交于
      Cleanup stale comments on munlock_vma_page().
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6927c1dd
    • L
      mm: remove unevictable_migrate_page function · 418b27ef
      Lee Schermerhorn 提交于
      unevictable_migrate_page() in mm/internal.h is a relic of the since
      removed UNEVICTABLE_LRU Kconfig option.  This patch removes the function
      and open codes the test in migrate_page_copy().
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      418b27ef
    • M
      hugetlb: acquire the i_mmap_lock before walking the prio_tree to unmap a page · 4eb2b1dc
      Mel Gorman 提交于
      When the owner of a mapping fails COW because a child process is holding a
      reference, the children VMAs are walked and the page is unmapped.  The
      i_mmap_lock is taken for the unmapping of the page but not the walking of
      the prio_tree.  In theory, that tree could be changing if the lock is not
      held.  This patch takes the i_mmap_lock properly for the duration of the
      prio_tree walk.
      
      [hugh.dickins@tiscali.co.uk: Spotted the problem in the first place]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb2b1dc
    • A
      'sysctl_max_map_count' should be non-negative · 70da2340
      Amerigo Wang 提交于
      Jan Engelhardt reported we have this problem:
      
      setting max_map_count to a value large enough results in programs dying at
      first try.  This is on 2.6.31.6:
      
      15:59 borg:/proc/sys/vm # echo $[1<<31-1] >max_map_count
      15:59 borg:/proc/sys/vm # cat max_map_count
      1073741824
      15:59 borg:/proc/sys/vm # echo $[1<<31] >max_map_count
      15:59 borg:/proc/sys/vm # cat max_map_count
      Killed
      
      This is because we have a chance to make 'max_map_count' negative.  but
      it's meaningless.  Make it only accept non-negative values.
      Reported-by: NJan Engelhardt <jengelh@medozas.de>
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: James Morris <jmorris@namei.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70da2340
    • H
      include/linux/mm.h: remove unneeded ifdef · f096e59e
      Huang Shijie 提交于
      The check code for CONFIG_SWAP is redundant, because there is a
      non-CONFIG_SWAP version for PageSwapCache() which just returns 0.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f096e59e
    • M
      mm: uncached vma support with writenotify · c9d0bf24
      Magnus Damm 提交于
      Modify the generic mmap() code to keep the cache attribute in
      vma->vm_page_prot regardless if writenotify is enabled or not.  Without
      this patch the cache configuration selected by f_op->mmap() is overwritten
      if writenotify is enabled, making it impossible to keep the vma uncached.
      
      Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses
      deferred io together with uncached memory.
      Signed-off-by: NMagnus Damm <damm@opensource.se>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jaya Kumar <jayakumar.lkml@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d0bf24
    • H
      vmscan: simplify code · 62c0c2f1
      Huang Shijie 提交于
      Simplify the code for shrink_inactive_list().
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62c0c2f1
    • R
      vmscan: do not evict inactive pages when skipping an active list scan · b39415b2
      Rik van Riel 提交于
      In AIM7 runs, recent kernels start swapping out anonymous pages well
      before they should.  This is due to shrink_list falling through to
      shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we
      really wanted to do is pre-age some anonymous pages to give them extra
      time to be referenced while on the inactive list.
      
      The obvious fix is to make sure that shrink_list does not fall through to
      scanning/reclaiming inactive pages when we called it to scan one of the
      active lists.
      
      This change should be safe because the loop in shrink_zone ensures that we
      will still shrink the anon and file inactive lists whenever we should.
      
      [kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()]
      Reported-by: NLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tomasz Chmielewski <mangoo@wpkg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b39415b2
    • J
      mm/bootmem.c: properly __init-annotate helper functions · 8aa043d7
      Jan Beulich 提交于
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aa043d7
    • D
      mm: slab-allocate memory section nodemask for large systems · 9ae49fab
      David Rientjes 提交于
      Nodemasks should not be allocated on the stack for large systems (when it
      is larger than 256 bytes) since there is a threat of overflow.
      
      This patch causes the unregister_mem_sect_under_nodes() nodemask to be
      allocated on the stack for smaller systems and be allocated by slab for
      larger systems.
      
      GFP_KERNEL is used since remove_memory_block() can block.
      
      Cc: Gary Hade <garyhade@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Alex Chiang <achiang@hp.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ae49fab
    • K
      mm: simplify try_to_unmap_one() · caed0f48
      KOSAKI Motohiro 提交于
      SWAP_MLOCK mean "We marked the page as PG_MLOCK, please move it to
      unevictable-lru". So, following code is easy confusable.
      
              if (vma->vm_flags & VM_LOCKED) {
                      ret = SWAP_MLOCK;
                      goto out_unmap;
              }
      
      Plus, if the VMA doesn't have VM_LOCKED, We don't need to check
      the needed of calling mlock_vma_page().
      
      Also, add some commentary to try_to_unmap_one().
      Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      caed0f48
    • R
      mm: fix section mismatch in memory_hotplug.c · 23ce932a
      Rakib Mullick 提交于
      __free_pages_bootmem() is a __meminit function - which has been called
      from put_pages_bootmem thus causes a section mismatch warning.
      
       We were warned by the following warning:
      
        LD      mm/built-in.o
      WARNING: mm/built-in.o(.text+0x26b22): Section mismatch in reference
      from the function put_page_bootmem() to the function
      .meminit.text:__free_pages_bootmem()
      The function put_page_bootmem() references
      the function __meminit __free_pages_bootmem().
      This is often because put_page_bootmem lacks a __meminit
      annotation or the annotation of __free_pages_bootmem is wrong.
      Signed-off-by: NRakib Mullick <rakib.mullick@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23ce932a
    • L
      hugetlb: prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails · b76c8cfb
      Larry Woodman 提交于
      hugetlb_fault() takes the mm->page_table_lock spinlock then calls
      hugetlb_cow().  If the alloc_huge_page() in hugetlb_cow() fails due to an
      insufficient huge page pool it calls unmap_ref_private() with the
      mm->page_table_lock held.  unmap_ref_private() then calls
      unmap_hugepage_range() which tries to acquire the mm->page_table_lock.
      
      [<ffffffff810928c3>] print_circular_bug_tail+0x80/0x9f
       [<ffffffff8109280b>] ? check_noncircular+0xb0/0xe8
       [<ffffffff810935e0>] __lock_acquire+0x956/0xc0e
       [<ffffffff81093986>] lock_acquire+0xee/0x12e
       [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
       [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
       [<ffffffff814c348d>] _spin_lock+0x40/0x89
       [<ffffffff8111a7a6>] ? unmap_hugepage_range+0x3e/0x84
       [<ffffffff8111afee>] ? alloc_huge_page+0x218/0x318
       [<ffffffff8111a7a6>] unmap_hugepage_range+0x3e/0x84
       [<ffffffff8111b2d0>] hugetlb_cow+0x1e2/0x3f4
       [<ffffffff8111b935>] ? hugetlb_fault+0x453/0x4f6
       [<ffffffff8111b962>] hugetlb_fault+0x480/0x4f6
       [<ffffffff8111baee>] follow_hugetlb_page+0x116/0x2d9
       [<ffffffff814c31a7>] ? _spin_unlock_irq+0x3a/0x5c
       [<ffffffff81107b4d>] __get_user_pages+0x2a3/0x427
       [<ffffffff81107d0f>] get_user_pages+0x3e/0x54
       [<ffffffff81040b8b>] get_user_pages_fast+0x170/0x1b5
       [<ffffffff81160352>] dio_get_page+0x64/0x14a
       [<ffffffff8116112a>] __blockdev_direct_IO+0x4b7/0xb31
       [<ffffffff8115ef91>] blkdev_direct_IO+0x58/0x6e
       [<ffffffff8115e0a4>] ? blkdev_get_blocks+0x0/0xb8
       [<ffffffff810ed2c5>] generic_file_aio_read+0xdd/0x528
       [<ffffffff81219da3>] ? avc_has_perm+0x66/0x8c
       [<ffffffff81132842>] do_sync_read+0xf5/0x146
       [<ffffffff8107da00>] ? autoremove_wake_function+0x0/0x5a
       [<ffffffff81211857>] ? security_file_permission+0x24/0x3a
       [<ffffffff81132fd8>] vfs_read+0xb5/0x126
       [<ffffffff81133f6b>] ? fget_light+0x5e/0xf8
       [<ffffffff81133131>] sys_read+0x54/0x8c
       [<ffffffff81011e42>] system_call_fastpath+0x16/0x1b
      
      This can be fixed by dropping the mm->page_table_lock around the call to
      unmap_ref_private() if alloc_huge_page() fails, its dropped right below in
      the normal path anyway.  However, earlier in the that function, it's also
      possible to call into the page allocator with the same spinlock held.
      
      What this patch does is drop the spinlock before the page allocator is
      potentially entered.  The check for page allocation failure can be made
      without the page_table_lock as well as the copy of the huge page.  Even if
      the PTE changed while the spinlock was held, the consequence is that a
      huge page is copied unnecessarily.  This resolves both the double taking
      of the lock and sleeping with the spinlock held.
      
      [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock]
      Signed-off-by: NLarry Woodman <lwooman@redhat.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b76c8cfb
    • A
      mm: memory_hotplug: make offline_pages() static · b4e655a4
      Andrew Morton 提交于
      It has no references outside memory_hotplug.c.
      
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4e655a4
    • H
      ksm: remove unswappable max_kernel_pages · d0f209f6
      Hugh Dickins 提交于
      Now that ksm pages are swappable, and the known holes plugged, remove
      mention of unswappable kernel pages from KSM documentation and comments.
      
      Remove the totalram_pages/4 initialization of max_kernel_pages.  In fact,
      remove max_kernel_pages altogether - we can reinstate it if removal turns
      out to break someone's script; but if we later want to limit KSM's memory
      usage, limiting the stable nodes would not be an effective approach.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0f209f6
    • H
      ksm: memory hotremove migration only · 62b61f61
      Hugh Dickins 提交于
      The previous patch enables page migration of ksm pages, but that soon gets
      into trouble: not surprising, since we're using the ksm page lock to lock
      operations on its stable_node, but page migration switches the page whose
      lock is to be used for that.  Another layer of locking would fix it, but
      do we need that yet?
      
      Do we actually need page migration of ksm pages?  Yes, memory hotremove
      needs to offline sections of memory: and since we stopped allocating ksm
      pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
      candidates for migration.
      
      But KSM is currently unconscious of NUMA issues, happily merging pages
      from different NUMA nodes: at present the rule must be, not to use
      MADV_MERGEABLE where you care about NUMA.  So no, NUMA page migration of
      ksm pages does not make sense yet.
      
      So, to complete support for ksm swapping we need to make hotremove safe.
      ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
      release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE.  But if mapped pages
      are freed before migration reaches them, stable_nodes may be left still
      pointing to struct pages which have been removed from the system: the
      stable_node needs to identify a page by pfn rather than page pointer, then
      it can safely prune them when MEM_OFFLINE.
      
      And make NUMA migration skip PageKsm pages where it skips PageReserved.
      But it's only when we reach unmap_and_move() that the page lock is taken
      and we can be sure that raised pagecount has prevented a PageAnon from
      being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
      page when offlining (has sufficient locking) but reject it otherwise.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b61f61
    • H
      ksm: rmap_walk to remove_migation_ptes · e9995ef9
      Hugh Dickins 提交于
      A side-effect of making ksm pages swappable is that they have to be placed
      on the LRUs: which then exposes them to isolate_lru_page() and hence to
      page migration.
      
      Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
      rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c.  Perhaps some
      consolidation with existing code is possible, but don't attempt that yet
      (try_to_unmap needs to handle nonlinears, but migration pte removal does
      not).
      
      rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
      remove_anon_migration_ptes() which it replaces, avoids calling
      page_lock_anon_vma(), because that includes a page_mapped() test which
      fails when all migration ptes are in place.  That was valid when NUMA page
      migration was introduced (holding mmap_sem provided the missing guarantee
      that anon_vma's slab had not already been destroyed), but I believe not
      valid in the memory hotremove case added since.
      
      For now do the same as before, and consider the best way to fix that
      unlikely race later on.  When fixed, we can probably use rmap_walk() on
      hwpoisoned ksm pages too: for now, they remain among hwpoison's various
      exceptions (its PageKsm test comes before the page is locked, but its
      page_lock_anon_vma fails safely if an anon gets upgraded).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9995ef9
    • H
      ksm: mem cgroup charge swapin copy · 407f9c8b
      Hugh Dickins 提交于
      But ksm swapping does require one small change in mem cgroup handling.
      When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
      substitute a duplicate page to accommodate a different anon_vma (or a the
      !PageSwapCache check in mem_cgroup_try_charge_swapin().
      
      That was returning success without charging, on the assumption that
      pte_same() would fail after, which is not the case here.  Originally I
      proposed that success, so that an unshrinkable mem cgroup at its limit
      would not fail unnecessarily; but that's a minor point, and there are
      plenty of other places where we may fail an overallocation which might
      later prove unnecessary.  So just go ahead and do what all the other
      exceptions do: proceed to charge current mm.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      407f9c8b
    • H
      ksm: share anon page without allocating · 80e14822
      Hugh Dickins 提交于
      When ksm pages were unswappable, it made no sense to include them in mem
      cgroup accounting; but now that they are swappable (although I see no
      strict logical connection) the principle of least surprise implies that
      they should be accounted (with the usual dissatisfaction, that a shared
      page is accounted to only one of the cgroups using it).
      
      This patch was intended to add mem cgroup accounting where necessary; but
      turned inside out, it now avoids allocating a ksm page, instead upgrading
      an anon page to ksm - which brings its existing mem cgroup accounting with
      it.  Thus mem cgroups don't appear in the patch at all.
      
      This upgrade from PageAnon to PageKsm takes place under page lock (via a
      somewhat hacky NULL kpage interface), and audit showed only one place
      which needed to cope with the race - page_referenced() is sometimes used
      without page lock, so page_lock_anon_vma() needs an ACCESS_ONCE() to be
      sure of getting anon_vma and flags together (no problem if the page goes
      ksm an instant after, the integrity of that anon_vma list is unaffected).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80e14822
    • H
      ksm: take keyhole reference to page · 4035c07a
      Hugh Dickins 提交于
      There's a lamentable flaw in KSM swapping: the stable_node holds a
      reference to the ksm page, so the page to be freed cannot actually be
      freed until ksmd works its way around to removing the last rmap_item from
      its stable_node.  Which in some configurations may take minutes: not quite
      responsive enough for memory reclaim.  And we don't want to twist KSM and
      its locking more tightly into the rest of mm.  What a pity.
      
      But although the stable_node needs to hold a pointer to the ksm page, does
      it actually need to raise the reference count of that page?
      
      No.  It would need to do so if struct pages were ordinary kmalloc'ed
      objects; but they are more stable than that, and reused in particular ways
      according to particular rules.
      
      Access to stable_node from its pointer in struct page is no problem, so
      long as we never free a stable_node before the ksm page itself has been
      freed.  Access to struct page from its pointer in stable_node: reintroduce
      get_ksm_page(), and let that peep out through its keyhole (the stable_node
      pointer to ksm page), to see if that struct page still holds the right key
      to open it (the ksm page mapping pointer back to this stable_node).
      
      This relies upon the established way in which free_hot_cold_page() sets an
      anon (including ksm) page->mapping to NULL; and relies upon no other user
      of a struct page to put something which looks like the original
      stable_node pointer (with two low bits also set) into page->mapping.  It
      also needs get_page_unless_zero() technique pioneered by speculative
      pagecache; and uses rcu_read_lock() to keep the guarantees that gives.
      
      There are several drivers which put pointers of their own into page->
      mapping; but none of those could coincide with our stable_node pointers,
      since KSM won't free a stable_node until it sees that the page has gone.
      
      The only problem case found is the pagetable spinlock USE_SPLIT_PTLOCKS
      places in struct page (my own abuse): to accommodate GENERIC_LOCKBREAK's
      break_lock on 32-bit, that spans both page->private and page->mapping.
      Since break_lock is only 0 or 1, again no confusion for get_ksm_page().
      
      But what of DEBUG_SPINLOCK on 64-bit bigendian?  When owner_cpu is 3
      (matching PageKsm low bits), it might see 0xdead4ead00000003 in page->
      mapping, which might coincide?  We could get around that by...  but a
      better answer is to suppress USE_SPLIT_PTLOCKS when DEBUG_SPINLOCK or
      DEBUG_LOCK_ALLOC, to stop bloating sizeof(struct page) in their case -
      already proposed in an earlier mm/Kconfig patch.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4035c07a
    • H
      ksm: hold anon_vma in rmap_item · db114b83
      Hugh Dickins 提交于
      For full functionality, page_referenced_one() and try_to_unmap_one() need
      to know the vma: to pass vma down to arch-dependent flushes, or to observe
      VM_LOCKED or VM_EXEC.  But KSM keeps no record of vma: nor can it, since
      vmas get split and merged without its knowledge.
      
      Instead, note page's anon_vma in its rmap_item when adding to stable tree:
      all the vmas which might map that page are listed by its anon_vma.
      
      page_referenced_ksm() and try_to_unmap_ksm() then traverse the anon_vma,
      first to find the probable vma, that which matches rmap_item's mm; but if
      that is not enough to locate all instances, traverse again to try the
      others.  This catches those occasions when fork has duplicated a pte of a
      ksm page, but ksmd has not yet come around to assign it an rmap_item.
      
      But each rmap_item in the stable tree which refers to an anon_vma needs to
      take a reference to it.  Andrea's anon_vma design cleverly avoided a
      reference count (an anon_vma was free when its list of vmas was empty),
      but KSM now needs to add that.  Is a 32-bit count sufficient?  I believe
      so - the anon_vma is only free when both count is 0 and list is empty.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db114b83
    • H
      ksm: let shared pages be swappable · 5ad64688
      Hugh Dickins 提交于
      Initial implementation for swapping out KSM's shared pages: add
      page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
      faced with a PageKsm page.
      
      Most of what's needed can be got from the rmap_items listed from the
      stable_node of the ksm page, without discovering the actual vma: so in
      this patch just fake up a struct vma for page_referenced_one() or
      try_to_unmap_one(), then refine that in the next patch.
      
      Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
      implicit there (being only set with VM_SHARED, already excluded), but
      let's make it explicit, to help justify the lack of nonlinear unmap.
      
      Rely on the page lock to protect against concurrent modifications to that
      page's node of the stable tree.
      
      The awkward part is not swapout but swapin: do_swap_page() and
      page_add_anon_rmap() now have to allow for new possibilities - perhaps a
      ksm page still in swapcache, perhaps a swapcache page associated with one
      location in one anon_vma now needed for another location or anon_vma.
      (And the vma might even be no longer VM_MERGEABLE when that happens.)
      
      ksm_might_need_to_copy() checks for that case, and supplies a duplicate
      page when necessary, simply leaving it to a subsequent pass of ksmd to
      rediscover the identity and merge them back into one ksm page.
      Disappointingly primitive: but the alternative would have to accumulate
      unswappable info about the swapped out ksm pages, limiting swappability.
      
      Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
      particular case it was handling, so just use it instead.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ad64688
    • H
      ksm: fix mlockfreed to munlocked · 73848b46
      Hugh Dickins 提交于
      When KSM merges an mlocked page, it has been forgetting to munlock it:
      that's been left to free_page_mlock(), which reports it in /proc/vmstat as
      unevictable_pgs_mlockfreed instead of unevictable_pgs_munlocked (and
      whinges "Page flag mlocked set for process" in mmotm, whereas mainline is
      silently forgiving).  Call munlock_vma_page() to fix that.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73848b46
    • H
      ksm: stable_node point to page and back · 08beca44
      Hugh Dickins 提交于
      Add a pointer to the ksm page into struct stable_node, holding a reference
      to the page while the node exists.  Put a pointer to the stable_node into
      the ksm page's ->mapping.
      
      Then we don't need get_ksm_page() while traversing the stable tree: the
      page to compare against is sure to be present and correct, even if it's no
      longer visible through any of its existing rmap_items.
      
      And we can handle the forked ksm page case more efficiently: no need to
      memcmp our way through the tree to find its match.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08beca44
    • H
      ksm: separate stable_node · 7b6ba2c7
      Hugh Dickins 提交于
      Though we still do well to keep rmap_items in the unstable tree without a
      separate tree_item at the node, for several reasons it becomes awkward to
      keep rmap_items in the stable tree without a separate stable_node: lack of
      space in the nicely-sized rmap_item, the need for an anchor as rmap_items
      are removed, the need for a node even when temporarily no rmap_items are
      attached to it.
      
      So declare struct stable_node (rb_node to place it in the tree and
      hlist_head for the rmap_items hanging off it), and convert stable tree
      handling to use it: without yet taking advantage of it.  Note how one
      stable_tree_insert() of a node now has _two_ stable_tree_append()s of the
      two rmap_items being merged.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b6ba2c7
    • H
      ksm: singly-linked rmap_list · 6514d511
      Hugh Dickins 提交于
      Free up a pointer in struct rmap_item, by making the mm_slot's rmap_list a
      singly-linked list: we always traverse that list sequentially, and we
      don't even lose any prefetches (but should consider adding a few later).
      Name it rmap_list throughout.
      
      Do we need to free up that pointer?  Not immediately, and in the end, we
      could continue to avoid it with a union; but having done the conversion,
      let's keep it this way, since there's no downside, and maybe we'll want
      more in future (struct rmap_item is a cache-friendly 32 bytes on 32-bit
      and 64 bytes on 64-bit, so we shall want to avoid expanding it).
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6514d511
    • H
      ksm: cleanup some function arguments · 8dd3557a
      Hugh Dickins 提交于
      Cleanup: make argument names more consistent from cmp_and_merge_page()
      down to replace_page(), so that it's easier to follow the rmap_item's page
      and the matching tree_page and the merged kpage through that code.
      
      In some places, e.g.  break_cow(), pass rmap_item instead of separate mm
      and address.
      
      cmp_and_merge_page() initialize tree_page to NULL, to avoid a "may be used
      uninitialized" warning seen in one config by Anil SB.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dd3557a
    • H
      ksm: remove redundancies when merging page · 31e855ea
      Hugh Dickins 提交于
      There is no need for replace_page() to calculate a write-protected prot
      vm_page_prot must already be write-protected for an anonymous page (see
      mm/memory.c do_anonymous_page() for similar reliance on vm_page_prot).
      
      There is no need for try_to_merge_one_page() to get_page and put_page on
      newpage and oldpage: in every case we already hold a reference to each of
      them.
      
      But some instinct makes me move try_to_merge_one_page()'s unlock_page of
      oldpage down after replace_page(): that doesn't increase contention on the
      ksm page, and makes thinking about the transition easier.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31e855ea
    • H
      ksm: three remove_rmap_item_from_tree cleanups · 93d17715
      Hugh Dickins 提交于
      1. remove_rmap_item_from_tree() is called as a precaution from
         various places: don't dirty the rmap_item cacheline unnecessarily,
         just mask the flags out of the address when they have been set.
      
      2. First get_next_rmap_item() removes an unstable rmap_item from its tree,
         then shortly afterwards cmp_and_merge_page() removes a stable rmap_item
         from its tree: it's easier just to do both at once (but definitely keep
         the BUG_ON(age > 1) which guards against a future omission).
      
      3. When cmp_and_merge_page() moves an rmap_item from unstable to stable
         tree, it does its own rb_erase() and accounting: that's better
         expressed by remove_rmap_item_from_tree().
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93d17715
    • K
      vmscan: make consistent of reclaim bale out between do_try_to_free_page and shrink_zone · 338fde90
      KOSAKI Motohiro 提交于
      Fix small inconsistent of ">" and ">=".
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      338fde90
    • K
      vmscan: kill sc.swap_cluster_max · ece74b2e
      KOSAKI Motohiro 提交于
      Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX.
      Then, we can remove it perfectly.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ece74b2e
    • K
      vmscan: zone_reclaim() don't use insane swap_cluster_max · 4f0ddfdf
      KOSAKI Motohiro 提交于
      In old days, we didn't have sc.nr_to_reclaim and it brought
      sc.swap_cluster_max misuse.
      
      huge sc.swap_cluster_max might makes unnecessary OOM risk and no
      performance benefit.
      
      Now, we can stop its insane thing.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f0ddfdf
    • K
      vmscan: kill hibernation specific reclaim logic and unify it · 7b51755c
      KOSAKI Motohiro 提交于
      shrink_all_zone() was introduced by commit d6277db4 (swsusp: rework
      memory shrinker) for hibernate performance improvement.  and
      sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
      memory for suspend).
      
      commit a06fe4d307 said
      
         Without the patch:
         Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
         Freed  88563 pages in 14719 jiffies = 23.50 MB/s
         Freed 205734 pages in 32389 jiffies = 24.81 MB/s
      
         With the patch:
         Freed  68252 pages in   496 jiffies = 537.52 MB/s
         Freed 116464 pages in   569 jiffies = 798.54 MB/s
         Freed 209699 pages in   705 jiffies = 1161.89 MB/s
      
      At that time, their patch was pretty worth.  However, Modern Hardware
      trend and recent VM improvement broke its worth.  From several reason, I
      think we should remove shrink_all_zones() at all.
      
      detail:
      
      1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
        at no i/o congestion.
        but current shrink_zone() is sane, not slow.
      
      2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
        fine on numa system.
        example)
          System has 4GB memory and each node have 2GB. and hibernate need 1GB.
      
          optimal)
             steal 500MB from each node.
          shrink_all_zones)
             steal 1GB from node-0.
      
        Oh, Cache balancing logic was broken. ;)
        Unfortunately, Desktop system moved ahead NUMA at nowadays.
        (Side note, if hibernate require 2GB, shrink_all_zones() never success
         on above machine)
      
      3) if the node has several I/O flighting pages, shrink_all_zones() makes
        pretty bad result.
      
        schenario) hibernate need 1GB
      
        1) shrink_all_zones() try to reclaim 1GB from Node-0
        2) but it only reclaimed 990MB
        3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
        4) it reclaimed 990MB
      
        Oh, well. it reclaimed twice much than required.
        In the other hand, current shrink_zone() has sane baling out logic.
        then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
      
      4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
        shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
      
      Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
      it bring good reviewability and debuggability, and solve above problems.
      
      side note: Reclaim logic unificication makes two good side effect.
       - Fix recursive reclaim bug on shrink_all_memory().
         it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
       - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b51755c