1. 30 5月, 2012 40 次提交
    • B
      mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks · 5ceb9ce6
      Bartlomiej Zolnierkiewicz 提交于
      When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type
      pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an
      allocation takes ownership of the block may take too long.  The type of
      the pageblock remains unchanged so the pageblock cannot be used as a
      migration target during compaction.
      
      Fix it by:
      
      * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and
        COMPACT_SYNC) and then converting sync field in struct compact_control
        to use it.
      
      * Adding nr_pageblocks_skipped field to struct compact_control and
        tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type.
         If COMPACT_ASYNC_MOVABLE mode compaction ran fully in
        try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a
        suitable page for allocation.  In this case then check how if there were
        enough MIGRATE_UNMOVABLE pageblocks to try a second pass in
        COMPACT_ASYNC_UNMOVABLE mode.
      
      * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and
        COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on
        finding PageBuddy pages, page_count(page) == 0 or PageLRU pages.  If all
        pages within the MIGRATE_UNMOVABLE pageblock are in one of those three
        sets change the whole pageblock type to MIGRATE_MOVABLE.
      
      My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means
      131072 standard 4KiB pages in 'Normal' zone) is to:
      
      - allocate 120000 pages for kernel's usage
      - free every second page (60000 pages) of memory just allocated
      - allocate and use 60000 pages from user space
      - free remaining 60000 pages of kernel memory
        (now we have fragmented memory occupied mostly by user space pages)
      - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage
      
      The results:
      - with compaction disabled I get 11 successful allocations
      - with compaction enabled - 14 successful allocations
      - with this patch I'm able to get all 100 successful allocations
      
      NOTE: If we can make kswapd aware of order-0 request during compaction, we
      can enhance kswapd with changing mode to COMPACT_ASYNC_FULL
      (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE).  Please see the
      following thread:
      
      	http://marc.info/?l=linux-mm&m=133552069417068&w=2
      
      [minchan@kernel.org: minor cleanups]
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ceb9ce6
    • J
      mm: remove sparsemem allocation details from the bootmem allocator · 238305bb
      Johannes Weiner 提交于
      alloc_bootmem_section() derives allocation area constraints from the
      specified sparsemem section.  This is a bit specific for a generic memory
      allocator like bootmem, though, so move it over to sparsemem.
      
      As __alloc_bootmem_node_nopanic() already retries failed allocations with
      relaxed area constraints, the fallback code in sparsemem.c can be removed
      and the code becomes a bit more compact overall.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      238305bb
    • J
      mm: bootmem: pass pgdat instead of pgdat->bdata down the stack · e9079911
      Johannes Weiner 提交于
      Pass down the node descriptor instead of the more specific bootmem node
      descriptor down the call stack, like nobootmem does, when there is no good
      reason for the two to be different.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9079911
    • J
      mm: nobootmem: unify allocation policy of (non-)panicking node allocations · ba539868
      Johannes Weiner 提交于
      While the panicking node-specific allocation function tries to satisfy
      node+goal, goal, node, anywhere, the non-panicking function still does
      node+goal, goal, anywhere.
      
      Make it simpler: define the panicking version in terms of the non-panicking
      one, like the node-agnostic interface, so they always behave the same way
      apart from how to deal with allocation failure.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba539868
    • J
      mm: nobootmem: panic on node-specific allocation failure · 2c478eae
      Johannes Weiner 提交于
      __alloc_bootmem_node and __alloc_bootmem_low_node documentation claims
      the functions panic on allocation failure.  Do it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c478eae
    • J
      mm: bootmem: unify allocation policy of (non-)panicking node allocations · 421456ed
      Johannes Weiner 提交于
      While the panicking node-specific allocation function tries to satisfy
      node+goal, goal, node, anywhere, the non-panicking function still does
      node+goal, goal, anywhere.
      
      Make it simpler: define the panicking version in terms of the
      non-panicking one, like the node-agnostic interface, so they always behave
      the same way apart from how to deal with allocation failure.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      421456ed
    • J
      mm: bootmem: allocate in order node+goal, goal, node, anywhere · ab381843
      Johannes Weiner 提交于
      Match the nobootmem version of __alloc_bootmem_node.  Try to satisfy both
      the node and the goal, then just the goal, then just the node, then
      allocate anywhere before panicking.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab381843
    • J
      mm: bootmem: split out goal-to-node mapping from goal dropping · c12ab504
      Johannes Weiner 提交于
      Matching the desired goal to the right node is one thing, dropping the
      goal when it can not be satisfied is another.  Split this into separate
      functions so that subsequent patches can use the node-finding but drop and
      handle the goal fallback on their own terms.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c12ab504
    • J
      mm: bootmem: rename alloc_bootmem_core to alloc_bootmem_bdata · c6785b6b
      Johannes Weiner 提交于
      Callsites need to provide a bootmem_data_t *, make the naming more
      descriptive.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Gavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6785b6b
    • J
      mm: bootmem: remove redundant offset check when finally freeing bootmem · 549381e1
      Johannes Weiner 提交于
      When bootmem releases an unaligned BITS_PER_LONG pages chunk of memory
      to the page allocator, it checks the bitmap if there are still
      unreserved pages in the chunk (set bits), but also if the offset in the
      chunk indicates BITS_PER_LONG loop iterations already.
      
      But since the consulted bitmap is only a one-word-excerpt of the full
      per-node bitmap, there can not be more than BITS_PER_LONG bits set in
      it.  The additional offset check is unnecessary.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      549381e1
    • G
      mm: bootmem: fix checking the bitmap when finally freeing bootmem · 6dccdcbe
      Gavin Shan 提交于
      When bootmem releases an unaligned chunk of memory at the beginning of a
      node to the page allocator, it iterates from that unaligned PFN but
      checks an aligned word of the page bitmap.  The checked bits do not
      correspond to the PFNs and, as a result, reserved pages can be freed.
      
      Properly shift the bitmap word so that the lowest bit corresponds to the
      starting PFN before entering the freeing loop.
      
      This bug has been around since commit 41546c17 ("bootmem: clean up
      free_all_bootmem_core") (2.6.27) without known reports.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dccdcbe
    • A
      mm/page_alloc.c: remove pageblock_default_order() · 955c1cd7
      Andrew Morton 提交于
      This has always been broken: one version takes an unsigned int and the
      other version takes no arguments.  This bug was hidden because one
      version of set_pageblock_order() was a macro which doesn't evaluate its
      argument.
      
      Simplify it all and remove pageblock_default_order() altogether.
      Reported-by: Nrajman mekaco <rajman.mekaco@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      955c1cd7
    • A
      mm: move is_vma_temporary_stack() declaration to huge_mm.h · 20995974
      Alex Shi 提交于
      When transparent_hugepage_enabled() is used outside mm/, such as in
      arch/x86/xx/tlb.c:
      
      +       if (!cpu_has_invlpg || vma->vm_flags & VM_HUGETLB
      +                       || transparent_hugepage_enabled(vma)) {
      +               flush_tlb_mm(vma->vm_mm);
      
      is_vma_temporary_stack() isn't referenced in huge_mm.h, so it has compile
      errors:
      
        arch/x86/mm/tlb.c: In function `flush_tlb_range':
        arch/x86/mm/tlb.c:324:4: error: implicit declaration of function `is_vma_temporary_stack' [-Werror=implicit-function-declaration]
      
      Since is_vma_temporay_stack() is just used in rmap.c and huge_memory.c, it
      is better to move it to huge_mm.h from rmap.h to avoid such errors.
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20995974
    • U
      tools/vm/page-types.c: cleanups · e30d539b
      Ulrich Drepper 提交于
      Compiling page-type.c with a recent compiler produces many warnings,
      mostly related to signed/unsigned comparisons.  This patch cleans up most
      of them.
      
      One remaining warning is about an unused parameter.  The <compiler.h> file
      doesn't define a __unused macro (or the like) yet.  This can be addressed
      later.
      Signed-off-by: NUlrich Drepper <drepper@gmail.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30d539b
    • U
      kbuild: install kernel-page-flags.h · 9295b7a0
      Ulrich Drepper 提交于
      Programs using /proc/kpageflags need to know about the various flags.  The
      <linux/kernel-page-flags.h> provides them and the comments in the file
      indicate that it is supposed to be used by user-level code.  But the file
      is not installed.
      
      Install the headers and mark the unstable flags as out-of-bounds.  The
      page-type tool is also adjusted to not duplicate the definitions
      Signed-off-by: NUlrich Drepper <drepper@gmail.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9295b7a0
    • B
      mm: print physical addresses consistently with other parts of kernel · a62e2f4f
      Bjorn Helgaas 提交于
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -Zone PFN ranges:
          +Zone ranges:
          -  DMA32    0x00000010 -> 0x00100000
          +  DMA32    [mem 0x00010000-0xffffffff]
          -  Normal   0x00100000 -> 0x01080000
          +  Normal   [mem 0x100000000-0x107fffffff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a62e2f4f
    • B
      swiotlb: print physical addresses consistently with other parts of kernel · 3af684c7
      Bjorn Helgaas 提交于
      Print swiotlb info in a style consistent with the %pR style used elsewhere
      in the kernel.  For example:
      
          -Placing 64MB software IO TLB between ffff88007a662000 - ffff88007e662000
          -software IO TLB at phys 0x7a662000 - 0x7e662000
          +software IO TLB [mem 0x7a662000-0x7e661fff] (64MB) mapped at [ffff88007a662000-ffff88007e661fff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3af684c7
    • B
      x86: print physical addresses consistently with other parts of kernel · 365811d6
      Bjorn Helgaas 提交于
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -found SMP MP-table at [ffff8800000fce90] fce90
          +found SMP MP-table at [mem 0x000fce90-0x000fce9f] mapped at [ffff8800000fce90]
          -initial memory mapped : 0 - 20000000
          +initial memory mapped: [mem 0x00000000-0x1fffffff]
          -Base memory trampoline at [ffff88000009c000] 9c000 size 8192
          +Base memory trampoline [mem 0x0009c000-0x0009dfff] mapped at [ffff88000009c000]
          -SRAT: Node 0 PXM 0 0-80000000
          +SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      365811d6
    • B
      x86: print e820 physical addresses consistently with other parts of kernel · 91eb0f67
      Bjorn Helgaas 提交于
      Print physical address info in a style consistent with the %pR style used
      elsewhere in the kernel.  For example:
      
          -BIOS-provided physical RAM map:
          +e820: BIOS-provided physical RAM map:
          - BIOS-e820: 0000000000000100 - 000000000009e000 (usable)
          +BIOS-e820: [mem 0x0000000000000100-0x000000000009dfff] usable
          -Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
          +e820: [mem 0x90000000-0xfed1bfff] available for PCI devices
          -reserve RAM buffer: 000000000009e000 - 000000000009ffff
          +e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91eb0f67
    • K
      bug: completely remove code generated by disabled VM_BUG_ON() · 02602a18
      Konstantin Khlebnikov 提交于
      Even if CONFIG_DEBUG_VM=n gcc genereates code for some VM_BUG_ON()
      
      for example VM_BUG_ON(!PageCompound(page) || !PageHead(page)); in
      do_huge_pmd_wp_page() generates 114 bytes of code.
      
      But they mostly disappears when I split this VM_BUG_ON into two:
      
        -VM_BUG_ON(!PageCompound(page) || !PageHead(page));
        +VM_BUG_ON(!PageCompound(page));
        +VM_BUG_ON(!PageHead(page));
      
      weird... but anyway after this patch code disappears completely.
      
        add/remove: 0/0 grow/shrink: 7/97 up/down: 135/-1784 (-1649)
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02602a18
    • K
      bug: introduce BUILD_BUG_ON_INVALID() macro · baf05aa9
      Konstantin Khlebnikov 提交于
      Sometimes we want to check some expressions correctness at compile time.
      "(void)(e);" or "if (e);" can be dangerous if the expression has
      side-effects, and gcc sometimes generates a lot of code, even if the
      expression has no effect.
      
      This patch introduces macro BUILD_BUG_ON_INVALID() for such checks, it
      forces a compilation error if expression is invalid without any extra
      code.
      
      [Cast to "long" required because sizeof does not work for bit-fields.]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baf05aa9
    • C
      Cross Memory Attach: make it Kconfigurable · 5febcbe9
      Christopher Yeoh 提交于
      Add a Kconfig option to allow people who don't want cross memory attach to
      not have it included in their build.
      Signed-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5febcbe9
    • J
      Documentation: memcg: future proof hierarchical statistics documentation · eb6332a5
      Johannes Weiner 提交于
      The hierarchical versions of per-memcg counters in memory.stat are all
      calculated the same way and are all named total_<counter>.
      
      Documenting the pattern is easier for maintenance than listing each
      counter twice.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Randy Dunlap <rdunlap@xenotime.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb6332a5
    • D
      mm, thp: drop page_table_lock to uncharge memcg pages · 6f60b69d
      David Rientjes 提交于
      mm->page_table_lock is hotly contested for page fault tests and isn't
      necessary to do mem_cgroup_uncharge_page() in do_huge_pmd_wp_page().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f60b69d
    • Y
      mm: rename is_mlocked_vma() to mlocked_vma_newpage() · 096a7cf4
      Ying Han 提交于
      Andrew pointed out that the is_mlocked_vma() is misnamed.  A function
      with name like that would expect bool return and no side-effects.
      
      Since it is called on the fault path for new page, rename it in this
      patch.
      Signed-off-by: NYing Han <yinghan@google.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      096a7cf4
    • J
      mm: memcg: count pte references from every member of the reclaimed hierarchy · c3ac9a8a
      Johannes Weiner 提交于
      The rmap walker checking page table references has historically ignored
      references from VMAs that were not part of the memcg that was being
      reclaimed during memcg hard limit reclaim.
      
      When transitioning global reclaim to memcg hierarchy reclaim, I missed
      that bit and now references from outside a memcg are ignored even during
      global reclaim.
      
      Reverting back to traditional behaviour - count all references during
      global reclaim and only mind references of the memcg being reclaimed
      during limit reclaim would be one option.
      
      However, the more generic idea is to ignore references exactly then when
      they are outside the hierarchy that is currently under reclaim; because
      only then will their reclamation be of any use to help the pressure
      situation.  It makes no sense to ignore references from a sibling memcg
      and then evict a page that will be immediately refaulted by that sibling
      which contributes to the same usage of the common ancestor under
      reclaim.
      
      The solution: make the rmap walker ignore references from VMAs that are
      not part of the hierarchy that is being reclaimed.
      
      Flat limit reclaim will stay the same, hierarchical limit reclaim will
      mind the references only to pages that the hierarchy owns.  Global
      reclaim, since it reclaims from all memcgs, will be fixed to regard all
      references.
      
      [akpm@linux-foundation.org: name the args in the declaration]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Konstantin Khlebnikov<khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ac9a8a
    • J
      kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite · 91c63734
      Johannes Weiner 提交于
      Library functions should not grab locks when the callsites can do it,
      even if the lock nests like the rcu read-side lock does.
      
      Push the rcu_read_lock() from css_is_ancestor() to its single user,
      mem_cgroup_same_or_subtree() in preparation for another user that may
      already hold the rcu read-side lock.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91c63734
    • A
      mm: do_migrate_pages(): rename arguments · 0ce72d4f
      Andrew Morton 提交于
      s/from_nodes/from and s/to_nodes/to/.  The "_nodes" is redundant - it
      duplicates the argument's type.
      
      Done in a fit of irritation over 80-col issues :(
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <mkosaki@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ce72d4f
    • L
      mm: do_migrate_pages() calls migrate_to_node() even if task is already on a correct node · 4a5b18cc
      Larry Woodman 提交于
      While running an application that moves tasks from one cpuset to another
      I noticed that it takes much longer and moves many more pages than
      expected.
      
      The reason for this is do_migrate_pages() does its best to preserve the
      relative node differential from the first node of the cpuset because the
      application may have been written with that in mind.  If memory was
      interleaved on the nodes of the source cpuset by an application
      do_migrate_pages() will try its best to maintain that interleaving on
      the nodes of the destination cpuset.  This means copying the memory from
      all source nodes to the destination nodes even if the source and
      destination nodes overlap.
      
      This is a problem for userspace NUMA placement tools.  The amount of
      time spent doing extra memory moves cancels out some of the NUMA
      performance improvements.  Furthermore, if the number of source and
      destination nodes are to maintain the previous interleaving layout
      anyway.
      
      This patch changes do_migrate_pages() to only preserve the relative
      layout inside the program if the number of NUMA nodes in the source and
      destination mask are the same.  If the number is different, we do a much
      more efficient migration by not touching memory that is in an allowed
      node.
      
      This preserves the old behaviour for programs that want it, while
      allowing a userspace NUMA placement tool to use the new, faster
      migration.  This improves performance in our tests by up to a factor of
      7.
      
      Without this change migrating tasks from a cpuset containing nodes 0-7
      to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if
      they are in the both the source and destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 4 to 3
         Migrating 1 to 4
         Migrating 3 to 4
         Migrating 0 to 3
         Migrating 2 to 3
      
      With this change we only migrate from nodes that are not in the
      destination nodesets:
      
         Migrating 7 to 4
         Migrating 6 to 3
         Migrating 5 to 4
         Migrating 2 to 3
         Migrating 1 to 4
         Migrating 0 to 3
      
      Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset
      containing 3,4,5 we still do move everything so that we preserve the
      desired NUMA offsets:
      
         Migrating 4 to 5
         Migrating 3 to 4
         Migrating 2 to 3
      
      As far as performance is concerned this simple patch improves the time
      it takes to move 14, 20 and 26 large tasks from a cpuset containing
      nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7.
      Here are the timings with and without the patch:
      
      BEFORE PATCH -- Move times: 59, 140, 651 seconds
      ============
      
        Moving 14 tasks from nodes (0-7) to nodes (1,3)
        numad(8780) do_migrate_pages (mm=0xffff88081d414400
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 8890 moved to node(s) 1,3 in 59.2 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,4-5)
        numad(8780) do_migrate_pages (mm=0xffff88081d88c700
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 8962 moved to node(s) 1,4-5 in 139.88 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1-3,5)
        numad(8780) do_migrate_pages (mm=0xffff88081d5bc740
        from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4)
        numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 9058 moved to node(s) 1-3,5 in 651.45 seconds
      
      AFTER PATCH -- Move times: 42, 56, 93 seconds
      ===========
      
        Moving 14 tasks from nodes (0-7) to nodes (5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d5ff140
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4)
        (Above moves repeated for each of the 14 tasks...)
        PID 33221 moved to node(s) 5,7 in 41.67 seconds
      
        Moving 20 tasks from nodes (0-7) to nodes (1,3,5)
        numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 20 tasks...)
        PID 33289 moved to node(s) 1,3,5 in 56.3 seconds
      
        Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7)
        numad(33209) do_migrate_pages (mm=0xffff88101d924400
        from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4)
        numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4)
        (Above moves repeated for each of the 26 tasks...)
        PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds
      
      [akpm@linux-foundation.org: clean up comment layout]
      Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a5b18cc
    • D
      thp, memcg: split hugepage for memcg oom on cow · 1f1d06c3
      David Rientjes 提交于
      On COW, a new hugepage is allocated and charged to the memcg.  If the
      system is oom or the charge to the memcg fails, however, the fault
      handler will return VM_FAULT_OOM which results in an oom kill.
      
      Instead, it's possible to fallback to splitting the hugepage so that the
      COW results only in an order-0 page being allocated and charged to the
      memcg which has a higher liklihood to succeed.  This is expensive
      because the hugepage must be split in the page fault handler, but it is
      much better than unnecessarily oom killing a process.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f1d06c3
    • S
      mm/vmstat.c: remove debug fs entries on failure of file creation and made... · bde8bd8a
      Sasikantha babu 提交于
      mm/vmstat.c: remove debug fs entries on failure of file creation and made extfrag_debug_root dentry local
      
      Remove debug fs files and directory on failure.  Since no one is using
      "extfrag_debug_root" dentry outside of extfrag_debug_init(), make it
      local to the function.
      Signed-off-by: NSasikantha babu <sasikanth.v19@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bde8bd8a
    • S
      mm/fork: fix overflow in vma length when copying mmap on clone · 7edc8b0a
      Siddhesh Poyarekar 提交于
      The vma length in dup_mmap is calculated and stored in a unsigned int,
      which is insufficient and hence overflows for very large maps (beyond
      16TB). The following program demonstrates this:
      
      #include <stdio.h>
      #include <unistd.h>
      #include <sys/mman.h>
      
      #define GIG 1024 * 1024 * 1024L
      #define EXTENT 16393
      
      int main(void)
      {
              int i, r;
              void *m;
              char buf[1024];
      
              for (i = 0; i < EXTENT; i++) {
                      m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L,
                               PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      
                      if (m == (void *)-1)
                              printf("MMAP Failed: %d\n", m);
                      else
                              printf("%d : MMAP returned %p\n", i, m);
      
                      r = fork();
      
                      if (r == 0) {
                              printf("%d: successed\n", i);
                              return 0;
                      } else if (r < 0)
                              printf("FORK Failed: %d\n", r);
                      else if (r > 0)
                              wait(NULL);
              }
              return 0;
      }
      
      Increase the storage size of the result to unsigned long, which is
      sufficient for storing the difference between addresses.
      Signed-off-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7edc8b0a
    • R
      mm/mmap.c: find_vma(): remove unnecessary if(mm) check · 841e31e5
      Rajman Mekaco 提交于
      The "if (mm)" check is not required in find_vma, as the kernel code
      calls find_vma only when it is absolutely sure that the mm_struct arg to
      it is non-NULL.
      
      Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now.  This
      will serve the purpose of mandating that the execution
      context(user-mode/kernel-mode) be known before find_vma is called.  Also
      fixed 2 checkpatch.pl errors in the declaration of the rb_node and
      vma_tmp local variables.
      
      I was browsing through the internet and read a discussion at
      https://lkml.org/lkml/2012/3/27/342 which discusses removal of the
      validation check within find_vma.  Since no-one responded, I decided to
      send this patch with Andrew's suggestions.
      
      [akpm@linux-foundation.org: add remove-me comment]
      Signed-off-by: NRajman Mekaco <rajman.mekaco@gmail.com>
      Cc: Kautuk Consul <consul.kautuk@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      841e31e5
    • T
      mm: use kcalloc() instead of kzalloc() to allocate array · 4d67d860
      Thomas Meyer 提交于
      The advantage of kcalloc is, that will prevent integer overflows which
      could result from the multiplication of number of elements and size and
      it is also a bit nicer to read.
      
      The semantic patch that makes this change is available in
      https://lkml.org/lkml/2011/11/25/107Signed-off-by: NThomas Meyer <thomas@m3y3r.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d67d860
    • R
      mm: fix off-by-one bug in print_nodes_state() · f6238818
      Ryota Ozaki 提交于
      /sys/devices/system/node/{online,possible} outputs a garbage byte
      because print_nodes_state() returns content size + 1.  To fix the bug,
      the patch changes the use of cpuset_sprintf_cpulist to follow the use at
      other places, which is clearer and safer.
      
      This bug was introduced in v2.6.24 (commit bde631a5: "mm: add node
      states sysfs class attributeS").
      Signed-off-by: NRyota Ozaki <ozaki.ryota@gmail.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6238818
    • M
      mm: vmscan: remove reclaim_mode_t · 23b9da55
      Mel Gorman 提交于
      There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC
      and lumpy reclaim have been removed.  This patch gets rid of
      reclaim_mode_t as well and improves the documentation about what
      reclaim/compaction is and when it is triggered.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b9da55
    • M
      mm: vmscan: do not stall on writeback during memory compaction · 41ac1999
      Mel Gorman 提交于
      This patch stops reclaim/compaction entering sync reclaim as this was
      only intended for lumpy reclaim and an oversight.  Page migration has
      its own logic for stalling on writeback pages if necessary and memory
      compaction is already using it.
      
      Waiting on page writeback is bad for a number of reasons but the primary
      one is that waiting on writeback to a slow device like USB can take a
      considerable length of time.  Page reclaim instead uses
      wait_iff_congested() to throttle if too many dirty pages are being
      scanned.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41ac1999
    • M
      mm: vmscan: remove lumpy reclaim · c53919ad
      Mel Gorman 提交于
      This series removes lumpy reclaim and some stalling logic that was
      unintentionally being used by memory compaction.  The end result is that
      stalling on dirty pages during page reclaim now depends on
      wait_iff_congested().
      
      Four kernels were compared
      
        3.3.0     vanilla
        3.4.0-rc2 vanilla
        3.4.0-rc2 lumpyremove-v2 is patch one from this series
        3.4.0-rc2 nosync-v2r3 is the full series
      
      Removing lumpy reclaim saves almost 900 bytes of text whereas the full
      series removes 1200 bytes.
      
           text     data      bss       dec     hex  filename
        67403754  1927944  2260992  10929311  a6c49f  vmlinux-3.4.0-rc2-vanilla
        6739479  1927944  2260992  10928415  a6c11f  vmlinux-3.4.0-rc2-lumpyremove-v2
        6739159  1927944  2260992  10928095  a6bfdf  vmlinux-3.4.0-rc2-nosync-v2
      
      There are behaviour changes in the series and so tests were run with
      monitoring of ftrace events.  This disrupts results so the performance
      results are distorted but the new behaviour should be clearer.
      
      fs-mark running in a threaded configuration showed little of interest as
      it did not push reclaim aggressively
      
        FS-Mark Multi Threaded
                                3.3.0-vanilla       rc2-vanilla       lumpyremove-v2r3       nosync-v2r3
        Files/s  min           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  mean          3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Files/s  stddev        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)        0.00 ( 0.00%)
        Files/s  max           3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)        3.20 ( 0.00%)
        Overhead min      508667.00 ( 0.00%)   521350.00 (-2.49%)   544292.00 (-7.00%)   547168.00 (-7.57%)
        Overhead mean     551185.00 ( 0.00%)   652690.73 (-18.42%)   991208.40 (-79.83%)   570130.53 (-3.44%)
        Overhead stddev    18200.69 ( 0.00%)   331958.29 (-1723.88%)  1579579.43 (-8578.68%)     9576.81 (47.38%)
        Overhead max      576775.00 ( 0.00%)  1846634.00 (-220.17%)  6901055.00 (-1096.49%)   585675.00 (-1.54%)
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             309.90    300.95    307.33    298.95
        User+Sys Time Running Test (seconds)        319.32    309.67    315.69    307.51
        Total Elapsed Time (seconds)               1187.85   1193.09   1191.98   1193.73
      
        MMTests Statistics: vmstat
        Page Ins                                       80532       82212       81420       79480
        Page Outs                                  111434984   111456240   111437376   111582628
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                           44881       27889       27453       34843
        Kswapd pages scanned                        25841428    25860774    25861233    25843212
        Kswapd pages reclaimed                      25841393    25860741    25861199    25843179
        Direct pages reclaimed                         44881       27889       27453       34843
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                            21754.791   21675.460   21696.029   21649.127
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                               37.783      23.375      23.031      29.188
        Percentage direct scans                           0%          0%          0%          0%
      
      ftrace showed that there was no stalling on writeback or pages submitted
      for IO from reclaim context.
      
      postmark was similar and while it was more interesting, it also did not
      push reclaim heavily.
      
        POSTMARK
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Transactions per second:               16.00 ( 0.00%)    20.00 (25.00%)    18.00 (12.50%)    17.00 ( 6.25%)
        Data megabytes read per second:        18.80 ( 0.00%)    24.27 (29.10%)    22.26 (18.40%)    20.54 ( 9.26%)
        Data megabytes written per second:     35.83 ( 0.00%)    46.25 (29.08%)    42.42 (18.39%)    39.14 ( 9.24%)
        Files created alone per second:        28.00 ( 0.00%)    38.00 (35.71%)    34.00 (21.43%)    30.00 ( 7.14%)
        Files create/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
        Files deleted alone per second:       556.00 ( 0.00%)  1224.00 (120.14%)  3062.00 (450.72%)  6124.00 (1001.44%)
        Files delete/transact per second:       8.00 ( 0.00%)    10.00 (25.00%)     9.00 (12.50%)     8.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             113.34    107.99    109.73    108.72
        User+Sys Time Running Test (seconds)        145.51    139.81    143.32    143.55
        Total Elapsed Time (seconds)               1159.16    899.23    980.17   1062.27
      
        MMTests Statistics: vmstat
        Page Ins                                    13710192    13729032    13727944    13760136
        Page Outs                                   43071140    42987228    42733684    42931624
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                               0           0           0           0
        Kswapd pages scanned                         9941613     9937443     9939085     9929154
        Kswapd pages reclaimed                       9940926     9936751     9938397     9928465
        Direct pages reclaimed                             0           0           0           0
        Kswapd efficiency                                99%         99%         99%         99%
        Kswapd velocity                             8576.567   11051.058   10140.164    9347.109
        Direct efficiency                               100%        100%        100%        100%
        Direct velocity                                0.000       0.000       0.000       0.000
      
      It looks like here that the full series regresses performance but as
      ftrace showed no usage of wait_iff_congested() or sync reclaim I am
      assuming it's a disruption due to monitoring.  Other data such as memory
      usage, page IO, swap IO all looked similar.
      
      Running a benchmark with a plain DD showed nothing very interesting.
      The full series stalled in wait_iff_congested() slightly less but stall
      times on vanilla kernels were marginal.
      
      Running a benchmark that hammered on file-backed mappings showed stalls
      due to congestion but not in sync writebacks
      
        MICRO
                                             3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             308.13    294.50    298.75    299.53
        User+Sys Time Running Test (seconds)        330.45    316.28    318.93    320.79
        Total Elapsed Time (seconds)               1814.90   1833.88   1821.14   1832.91
      
        MMTests Statistics: vmstat
        Page Ins                                      108712      120708       97224      110344
        Page Outs                                  155514576   156017404   155813676   156193256
        Swap Ins                                           0           0           0           0
        Swap Outs                                          0           0           0           0
        Direct pages scanned                         2599253     1550480     2512822     2414760
        Kswapd pages scanned                        69742364    71150694    68839041    69692533
        Kswapd pages reclaimed                      34824488    34773341    34796602    34799396
        Direct pages reclaimed                         53693       94750       61792       75205
        Kswapd efficiency                                49%         48%         50%         49%
        Kswapd velocity                            38427.662   38797.901   37799.972   38022.889
        Direct efficiency                                 2%          6%          2%          3%
        Direct velocity                             1432.174     845.464    1379.807    1317.446
        Percentage direct scans                           3%          2%          3%          3%
        Page writes by reclaim                             0           0           0           0
        Page writes file                                   0           0           0           0
        Page writes anon                                   0           0           0           0
        Page reclaim immediate                             0           0           0        1218
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                  15360       16384       13312       16384
        Direct inode steals                                0           0           0           0
        Kswapd inode steals                             4340        4327        1630        4323
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 0          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               900        870        754        789
        Direct time   conditional waited               0ms        0ms        0ms       20ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited              2106       2308       2116       1915
        KSwapd time   congest     waited          139924ms   157832ms   125652ms   132516ms
        KSwapd full   congest     waited              1346       1530       1202       1278
        KSwapd number conditional waited             12922      16320      10943      14670
        KSwapd time   conditional waited               0ms        0ms        0ms        0ms
        KSwapd full   conditional waited                 0          0          0          0
      
      Reclaim statistics are not radically changed.  The stall times in kswapd
      are massive but it is clear that it is due to calls to congestion_wait()
      and that is almost certainly the call in balance_pgdat().  Otherwise
      stalls due to dirty pages are non-existant.
      
      I ran a benchmark that stressed high-order allocation.  This is very
      artifical load but was used in the past to evaluate lumpy reclaim and
      compaction.  Generally I look at allocation success rates and latency
      figures.
      
        STRESS-HIGHALLOC
                         3.3.0-vanilla       rc2-vanilla  lumpyremove-v2r3       nosync-v2r3
        Pass 1          81.00 ( 0.00%)    28.00 (-53.00%)    24.00 (-57.00%)    28.00 (-53.00%)
        Pass 2          82.00 ( 0.00%)    39.00 (-43.00%)    38.00 (-44.00%)    43.00 (-39.00%)
        while Rested    88.00 ( 0.00%)    87.00 (-1.00%)    88.00 ( 0.00%)    88.00 ( 0.00%)
      
        MMTests Statistics: duration
        Sys Time Running Test (seconds)             740.93    681.42    685.14    684.87
        User+Sys Time Running Test (seconds)       2922.65   3269.52   3281.35   3279.44
        Total Elapsed Time (seconds)               1161.73   1152.49   1159.55   1161.44
      
        MMTests Statistics: vmstat
        Page Ins                                     4486020     2807256     2855944     2876244
        Page Outs                                    7261600     7973688     7975320     7986120
        Swap Ins                                       31694           0           0           0
        Swap Outs                                      98179           0           0           0
        Direct pages scanned                           53494       57731       34406      113015
        Kswapd pages scanned                         6271173     1287481     1278174     1219095
        Kswapd pages reclaimed                       2029240     1281025     1260708     1201583
        Direct pages reclaimed                          1468       14564       16649       92456
        Kswapd efficiency                                32%         99%         98%         98%
        Kswapd velocity                             5398.133    1117.130    1102.302    1049.641
        Direct efficiency                                 2%         25%         48%         81%
        Direct velocity                               46.047      50.092      29.672      97.306
        Percentage direct scans                           0%          4%          2%          8%
        Page writes by reclaim                       1616049           0           0           0
        Page writes file                             1517870           0           0           0
        Page writes anon                               98179           0           0           0
        Page reclaim immediate                        103778       27339        9796       17831
        Page rescued immediate                             0           0           0           0
        Slabs scanned                                1096704      986112      980992      998400
        Direct inode steals                              223      215040      216736      247881
        Kswapd inode steals                           175331       61548       68444       63066
        Kswapd skipped wait                            21991           0           1           0
        THP fault alloc                                    1         135         125         134
        THP collapse alloc                               393         311         228         236
        THP splits                                        25          13           7           8
        THP fault fallback                                 0           0           0           0
        THP collapse fail                                  3           5           7           7
        Compaction stalls                                865        1270        1422        1518
        Compaction success                               370         401         353         383
        Compaction failures                              495         869        1069        1135
        Compaction pages moved                        870155     3828868     4036106     4423626
        Compaction move failure                        26429       23865       29742       27514
      
      Success rates are completely hosed for 3.4-rc2 which is almost certainly
      due to commit fe2c2a10 ("vmscan: reclaim at order 0 when compaction
      is enabled").  I expected this would happen for kswapd and impair
      allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did
      not anticipate this much a difference: 80% less scanning, 37% less
      reclaim by kswapd
      
      In comparison, reclaim/compaction is not aggressive and gives up easily
      which is the intended behaviour.  hugetlbfs uses __GFP_REPEAT and would
      be much more aggressive about reclaim/compaction than THP allocations
      are.  The stress test above is allocating like neither THP or hugetlbfs
      but is much closer to THP.
      
      Mainline is now impaired in terms of high order allocation under heavy
      load although I do not know to what degree as I did not test with
      __GFP_REPEAT.  Keep this in mind for bugs related to hugepage pool
      resizing, THP allocation and high order atomic allocation failures from
      network devices.
      
      In terms of congestion throttling, I see the following for this test
      
        FTrace Reclaim Statistics: congestion_wait
        Direct number congest     waited                 3          0          0          0
        Direct time   congest     waited               0ms        0ms        0ms        0ms
        Direct full   congest     waited                 0          0          0          0
        Direct number conditional waited               957        512       1081       1075
        Direct time   conditional waited               0ms        0ms        0ms        0ms
        Direct full   conditional waited                 0          0          0          0
        KSwapd number congest     waited                36          4          3          5
        KSwapd time   congest     waited            3148ms      400ms      300ms      500ms
        KSwapd full   congest     waited                30          4          3          5
        KSwapd number conditional waited             88514        197        332        542
        KSwapd time   conditional waited            4980ms        0ms        0ms        0ms
        KSwapd full   conditional waited                49          0          0          0
      
      The "conditional waited" times are the most interesting as this is
      directly impacted by the number of dirty pages encountered during scan.
      As lumpy reclaim is no longer scanning contiguous ranges, it is finding
      fewer dirty pages.  This brings wait times from about 5 seconds to 0.
      kswapd itself is still calling congestion_wait() so it'll still stall but
      it's a lot less.
      
      In terms of the type of IO we were doing, I see this
      
        FTrace Reclaim Statistics: mm_vmscan_writepage
        Direct writes anon  sync                         0          0          0          0
        Direct writes anon  async                        0          0          0          0
        Direct writes file  sync                         0          0          0          0
        Direct writes file  async                        0          0          0          0
        Direct writes mixed sync                         0          0          0          0
        Direct writes mixed async                        0          0          0          0
        KSwapd writes anon  sync                         0          0          0          0
        KSwapd writes anon  async                    91682          0          0          0
        KSwapd writes file  sync                         0          0          0          0
        KSwapd writes file  async                   822629          0          0          0
        KSwapd writes mixed sync                         0          0          0          0
        KSwapd writes mixed async                        0          0          0          0
      
      In 3.2, kswapd was doing a bunch of async writes of pages but
      reclaim/compaction was never reaching a point where it was doing sync
      IO.  This does not guarantee that reclaim/compaction was not calling
      wait_on_page_writeback() but I would consider it unlikely.  It indicates
      that merging patches 2 and 3 to stop reclaim/compaction calling
      wait_on_page_writeback() should be safe.
      
      This patch:
      
      Lumpy reclaim had a purpose but in the mind of some, it was to kick the
      system so hard it trashed.  For others the purpose was to complicate
      vmscan.c.  Over time it was giving softer shoes and a nicer attitude but
      memory compaction needs to step up and replace it so this patch sends
      lumpy reclaim to the farm.
      
      The tracepoint format changes for isolating LRU pages with this patch
      applied.  Furthermore reclaim/compaction can no longer queue dirty pages
      in pageout() if the underlying BDI is congested.  Lumpy reclaim used
      this logic and reclaim/compaction was using it in error.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c53919ad
    • R
      mm: remove swap token code · e709ffd6
      Rik van Riel 提交于
      The swap token code no longer fits in with the current VM model.  It
      does not play well with cgroups or the better NUMA placement code in
      development, since we have only one swap token globally.
      
      It also has the potential to mess with scalability of the system, by
      increasing the number of non-reclaimable pages on the active and
      inactive anon LRU lists.
      
      Last but not least, the swap token code has been broken for a year
      without complaints, as reported by Konstantin Khlebnikov.  This suggests
      we no longer have much use for it.
      
      The days of sub-1G memory systems with heavy use of swap are over.  If
      we ever need thrashing reducing code in the future, we will have to
      implement something that does scale.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NBob Picco <bpicco@meloft.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e709ffd6
    • D
      mm, thp: allow fallback when pte_alloc_one() fails for huge pmd · edad9d2c
      David Rientjes 提交于
      The transparent hugepages feature is careful to not invoke the oom
      killer when a hugepage cannot be allocated.
      
      pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however,
      currently results in VM_FAULT_OOM which invokes the pagefault oom killer
      to kill a memory-hogging task.
      
      This is unnecessary since it's possible to drop the reference to the
      hugepage and fallback to allocating a small page.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edad9d2c