1. 24 2月, 2013 22 次提交
    • M
      mm: introduce mm_populate() for populating new vmas · bebeb3d6
      Michel Lespinasse 提交于
      When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or
      with MCL_FUTURE in effect), we want to populate the pages within the
      newly created vmas.  This may take a while as we may have to read pages
      from disk, so ideally we want to do this outside of the write-locked
      mmap_sem region.
      
      This change introduces mm_populate(), which is used to defer populating
      such mappings until after the mmap_sem write lock has been released.
      This is implemented as a generalization of the former do_mlock_pages(),
      which accomplished the same task but was using during mlock() /
      mlockall().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebeb3d6
    • M
      mm: remap_file_pages() fixes · 940e7da5
      Michel Lespinasse 提交于
      We have many vma manipulation functions that are fast in the typical
      case, but can optionally be instructed to populate an unbounded number
      of ptes within the region they work on:
      
       - mmap with MAP_POPULATE or MAP_LOCKED flags;
       - remap_file_pages() with MAP_NONBLOCK not set or when working on a
         VM_LOCKED vma;
       - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
         effect;
       - brk() when mlock(MCL_FUTURE) is in effect.
      
      Current code handles these pte operations locally, while the
      sourrounding code has to hold the mmap_sem write side since it's
      manipulating vmas.  This means we're doing an unbounded amount of pte
      population work with mmap_sem held, and this causes problems as Andy
      Lutomirski reported (we've hit this at Google as well, though it's not
      entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
      first place).
      
      I propose introducing a new mm_populate() function to do this pte
      population work after the mmap_sem has been released.  mm_populate()
      does need to acquire the mmap_sem read side, but critically, it doesn't
      need to hold it continuously for the entire duration of the operation -
      it can drop it whenever things take too long (such as when hitting disk
      for a file read) and re-acquire it later on.
      
      The following patches are included
      
      - Patches 1 fixes some issues I noticed while working on the existing code.
        If needed, they could potentially go in before the rest of the patches.
      
      - Patch 2 introduces the new mm_populate() function and changes
        mmap_region() call sites to use it after they drop mmap_sem. This is
        inspired from Andy Lutomirski's proposal and is built as an extension
        of the work I had previously done for mlock() and mlockall() around
        v2.6.38-rc1. I had tried doing something similar at the time but had
        given up as there were so many do_mmap() call sites; the recent cleanups
        by Linus and Viro are a tremendous help here.
      
      - Patches 3-5 convert some of the less-obvious places doing unbounded
        pte populates to the new mm_populate() mechanism.
      
      - Patches 6-7 are code cleanups that are made possible by the
        mm_populate() work. In particular, they remove more code than the
        entire patch series added, which should be a good thing :)
      
      - Patch 8 is optional to this entire series. It only helps to deal more
        nicely with racy userspace programs that might modify their mappings
        while we're trying to populate them. It adds a new VM_POPULATE flag
        on the mappings we do want to populate, so that if userspace replaces
        them with mappings it doesn't want populated, mm_populate() won't
        populate those replacement mappings.
      
      This patch:
      
      Assorted small fixes. The first two are quite small:
      
      - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
        within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
        Purely cosmetic.
      
      - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
        range, make sure we own the mmap_sem write lock around the
        munlock_vma_pages_range call as this manipulates the vma's vm_flags.
      
      Last fix requires a longer explanation. remap_file_pages() can do its work
      either through VM_NONLINEAR manipulation or by creating extra vmas.
      These two cases were inconsistent with each other (and ultimately, both wrong)
      as to exactly when did they fault in the newly mapped file pages:
      
      - In the VM_NONLINEAR case, new file pages would be populated if
        the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
        new file pages wouldn't be populated even if the vma is already
        marked as VM_LOCKED.
      
      - In the linear (emulated) case, the work is passed to the mmap_region()
        function which would populate the pages if the vma is marked as
        VM_LOCKED, and would not otherwise - regardless of the value of the
        MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
        mmap_region().
      
      The desired behavior is that we want the pages to be populated and locked
      if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
      flag is not passed to remap_file_pages().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Greg Ungerer <gregungerer@westnet.com.au>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      940e7da5
    • Z
      mm: avoid calling pgdat_balanced() needlessly · dafcb73e
      Zlatko Calusic 提交于
      Now that balance_pgdat() is slightly tidied up, thanks to more capable
      pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
      check the status, then break the loop if pgdat is balanced, just to be
      immediately called again.  The second call is completely unnecessary, of
      course.
      
      The patch introduces pgdat_is_balanced boolean, which helps resolve the
      above suboptimal behavior, with the added benefit of slightly better
      documenting one other place in the function where we jump and skip lots
      of code.
      Signed-off-by: NZlatko Calusic <zlatko.calusic@iskon.hr>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dafcb73e
    • A
      mm: compaction: make __compact_pgdat() and compact_pgdat() return void · 7103f16d
      Andrew Morton 提交于
      These functions always return 0.  Formalise this.
      
      Cc: Jason Liu <r64343@freescale.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7103f16d
    • S
      mm: make madvise(MADV_WILLNEED) support swap file prefetch · 1998cc04
      Shaohua Li 提交于
      Make madvise(MADV_WILLNEED) support swap file prefetch.  If memory is
      swapout, this syscall can do swapin prefetch.  It has no impact if the
      memory isn't swapout.
      
      [akpm@linux-foundation.org: fix CONFIG_SWAP=n build]
      [sasha.levin@oracle.com: fix BUG on madvise early failure]
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1998cc04
    • M
      memcg,vmscan: do not break out targeted reclaim without reclaimed pages · a394cb8e
      Michal Hocko 提交于
      Targeted (hard resp soft) reclaim has traditionally tried to scan one
      group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
      pages) is reclaimed or all priorities are exhausted.  The reclaim is
      then retried until the limit is met.
      
      This approach, however, doesn't work well with deeper hierarchies where
      groups higher in the hierarchy do not have any or only very few pages
      (this usually happens if those groups do not have any tasks and they
      have only re-parented pages after some of their children is removed).
      Those groups are reclaimed with decreasing priority pointlessly as there
      is nothing to reclaim from them.
      
      An easiest fix is to break out of the memcg iteration loop in
      shrink_zone only if the whole hierarchy has been visited or sufficient
      pages have been reclaimed.  This is also more natural because the
      reclaimer expects that the hierarchy under the given root is reclaimed.
      As a result we can simplify the soft limit reclaim which does its own
      iteration.
      
      [yinghan@google.com: break out of the hierarchy loop only if nr_reclaimed exceeded nr_to_reclaim]
      [akpm@linux-foundation.org: use conventional comparison order]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NYing Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a394cb8e
    • S
      mm/ksm.c: use new hashtable implementation · 4ca3a69b
      Sasha Levin 提交于
      Switch ksm to use the new hashtable implementation.  This reduces the
      amount of generic unrelated code in the ksm module.
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ca3a69b
    • S
      mm/huge_memory.c: use new hashtable implementation · 43b5fbbd
      Sasha Levin 提交于
      Switch hugemem to use the new hashtable implementation.  This reduces
      the amount of generic unrelated code in the hugemem.
      
      This also removes the dymanic allocation of the hash table.  The upside
      is that we save a pointer dereference when accessing the hashtable, but
      we lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
      doesn't support hugepages.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43b5fbbd
    • M
      mm: compaction: do not accidentally skip pageblocks in the migrate scanner · a9aacbcc
      Mel Gorman 提交于
      Compaction uses the ALIGN macro incorrectly with the migrate scanner by
      adding pageblock_nr_pages to a PFN.  It happened to work when initially
      implemented as the starting PFN was also aligned but with caching
      restarts and isolating in smaller chunks this is no longer always true.
      
      The impact is that the migrate scanner scans outside its current
      pageblock.  As pfn_valid() is still checked properly it does not cause
      any failure and the impact of the bug is that in some cases it will scan
      more than necessary when it crosses a page boundary but by no more than
      COMPACT_CLUSTER_MAX.  It is highly unlikely this is even measurable but
      it's still wrong so this patch addresses the problem.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9aacbcc
    • A
      mm/vmscan.c:__zone_reclaim(): replace max_t() with max() · 62b726c1
      Andrew Morton 提交于
      "mm: vmscan: save work scanning (almost) empty LRU lists" made
      SWAP_CLUSTER_MAX an unsigned long.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b726c1
    • A
      mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long · 90ae8d67
      Andrew Morton 提交于
      `int' is an inappropriate type for a number-of-pages counter.
      
      While we're there, use the clamp() macro.
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90ae8d67
    • J
      mm: reduce rmap overhead for ex-KSM page copies created on swap faults · af34770e
      Johannes Weiner 提交于
      When ex-KSM pages are faulted from swap cache, the fault handler is not
      capable of re-establishing anon_vma-spanning KSM pages.  In this case, a
      copy of the page is created instead, just like during a COW break.
      
      These freshly made copies are known to be exclusive to the faulting VMA
      and there is no reason to go look for this page in parent and sibling
      processes during rmap operations.
      
      Use page_add_new_anon_rmap() for these copies.  This also puts them on
      the proper LRU lists and marks them SwapBacked, so we can get rid of
      doing this ad-hoc in the KSM copy code.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af34770e
    • J
      mm: vmscan: compaction works against zones, not lruvecs · 9b4f98cd
      Johannes Weiner 提交于
      The restart logic for when reclaim operates back to back with compaction
      is currently applied on the lruvec level.  But this does not make sense,
      because the container of interest for compaction is a zone as a whole,
      not the zone pages that are part of a certain memory cgroup.
      
      Negative impact is bounded.  For one, the code checks that the lruvec
      has enough reclaim candidates, so it does not risk getting stuck on a
      condition that can not be fulfilled.  And the unfairness of hammering on
      one particular memory cgroup to make progress in a zone will be
      amortized by the round robin manner in which reclaim goes through the
      memory cgroups.  Still, this can lead to unnecessary allocation
      latencies when the code elects to restart on a hard to reclaim or small
      group when there are other, more reclaimable groups in the zone.
      
      Move this logic to the zone level and restart reclaim for all memory
      cgroups in a zone when compaction requires more free pages from it.
      
      [akpm@linux-foundation.org: no need for min_t]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b4f98cd
    • J
      mm: vmscan: clean up get_scan_count() · 9a265114
      Johannes Weiner 提交于
      Reclaim pressure balance between anon and file pages is calculated
      through a tuple of numerators and a shared denominator.
      
      Exceptional cases that want to force-scan anon or file pages configure
      the numerators and denominator such that one list is preferred, which is
      not necessarily the most obvious way:
      
          fraction[0] = 1;
          fraction[1] = 0;
          denominator = 1;
          goto out;
      
      Make this easier by making the force-scan cases explicit and use the
      fractionals only in case they are calculated from reclaim history.
      
      [akpm@linux-foundation.org: avoid using unintialized_var()]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a265114
    • J
      mm: vmscan: improve comment on low-page cache handling · 11d16c25
      Johannes Weiner 提交于
      Fix comment style and elaborate on why anonymous memory is force-scanned
      when file cache runs low.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11d16c25
    • J
      mm: vmscan: clarify how swappiness, highest priority, memcg interact · 10316b31
      Johannes Weiner 提交于
      A swappiness of 0 has a slightly different meaning for global reclaim
      (may swap if file cache really low) and memory cgroup reclaim (never
      swap, ever).
      
      In addition, global reclaim at highest priority will scan all LRU lists
      equal to their size and ignore other balancing heuristics.  UNLESS
      swappiness forbids swapping, then the lists are balanced based on recent
      reclaim effectiveness.  UNLESS file cache is running low, then anonymous
      pages are force-scanned.
      
      This (total mess of a) behaviour is implicit and not obvious from the
      way the code is organized.  At least make it apparent in the code flow
      and document the conditions.  It will be it easier to come up with sane
      semantics later.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NSatoru Moriya <satoru.moriya@hds.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10316b31
    • J
      mm: vmscan: save work scanning (almost) empty LRU lists · d778df51
      Johannes Weiner 提交于
      In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
      amount of pages is scanned from the LRU lists on each iteration, to make
      progress.
      
      Do not make this minimum bigger than the respective LRU list size,
      however, and save some busy work trying to isolate and reclaim pages
      that are not there.
      
      Empty LRU lists are quite common with memory cgroups in NUMA
      environments because there exists a set of LRU lists for each zone for
      each memory cgroup, while the memory of a single cgroup is expected to
      stay on just one node.  The number of expected empty LRU lists is thus
      
        memcgs * (nodes - 1) * lru types
      
      Each attempt to reclaim from an empty LRU list does expensive size
      comparisons between lists, acquires the zone's lru lock etc.  Avoid
      that.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d778df51
    • J
      mm: memcg: only evict file pages when we have plenty · 7c5bd705
      Johannes Weiner 提交于
      Commit e9868505 ("mm, vmscan: only evict file pages when we have
      plenty") makes a point of not going for anonymous memory while there is
      still enough inactive cache around.
      
      The check was added only for global reclaim, but it is just as useful to
      reduce swapping in memory cgroup reclaim:
      
          200M-memcg-defconfig-j2
      
                                           vanilla                   patched
          Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
          User time              668.57 (  +0.00%)         668.73 (  +0.02%)
          System time            128.92 (  +0.00%)         129.53 (  +0.46%)
          Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
          Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
          Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
          Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
          THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
          THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
          THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Satoru Moriya <satoru.moriya@hds.com>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c5bd705
    • S
      CMA: make putback_lru_pages() call conditional · 2a6f5124
      Srinivas Pandruvada 提交于
      As per documentation and other places calling putback_lru_pages(),
      putback_lru_pages() is called on error only.  Make the CMA code behave
      consistently.
      
      [akpm@linux-foundation.org: remove a test-n-branch in the wrapup code]
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a6f5124
    • A
      mm/hugetlb.c: convert to pr_foo() · ffb22af5
      Andrew Morton 提交于
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffb22af5
    • A
      mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo() · d045197f
      Andrew Morton 提交于
      Acked-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d045197f
    • S
      memcg, oom: provide more precise dump info while memcg oom happening · 58cf188e
      Sha Zhengju 提交于
      Currently when a memcg oom is happening the oom dump messages is still
      global state and provides few useful info for users.  This patch prints
      more pointed memcg page statistics for memcg-oom and take hierarchy into
      consideration:
      
      Based on Michal's advice, we take hierarchy into consideration: supppose
      we trigger an OOM on A's limit
      
              root_memcg
                  |
                  A (use_hierachy=1)
                 / \
                B   C
                |
                D
      then the printed info will be:
      
        Memory cgroup stats for /A:...
        Memory cgroup stats for /A/B:...
        Memory cgroup stats for /A/C:...
        Memory cgroup stats for /A/B/D:...
      
      Following are samples of oom output:
      
      (1) Before change:
      
          mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
           ..... (call trace)
           [<ffffffff8168a818>] page_fault+0x28/0x30
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          Task in /A/B/D killed as a result of limit of /A
          memory: usage 101376kB, limit 101376kB, failcnt 57
          memory+swap: usage 101376kB, limit 101376kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
                                   <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
          Mem-Info:
          Node 0 DMA per-cpu:
          CPU    0: hi:    0, btch:   1 usd:   0
          ......
          CPU    3: hi:    0, btch:   1 usd:   0
          Node 0 DMA32 per-cpu:
          CPU    0: hi:  186, btch:  31 usd: 173
          ......
          CPU    3: hi:  186, btch:  31 usd: 130
                                   <<<<<<<<<<<<<<<<<<<<< print global page state
          active_anon:92963 inactive_anon:40777 isolated_anon:0
           active_file:33027 inactive_file:51718 isolated_file:0
           unevictable:0 dirty:3 writeback:0 unstable:0
           free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
           mapped:20278 shmem:35971 pagetables:5885 bounce:0
           free_cma:0
                                   <<<<<<<<<<<<<<<<<<<<< print per zone page state
          Node 0 DMA free:15836kB ... all_unreclaimable? no
          lowmem_reserve[]: 0 3175 3899 3899
          Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
          lowmem_reserve[]: 0 0 724 724
          lowmem_reserve[]: 0 0 0 0
          Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
          Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
          120710 total pagecache pages
          0 pages in swap cache
                                   <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
          Swap cache stats: add 0, delete 0, find 0/0
          Free swap  = 499708kB
          Total swap = 499708kB
          1040368 pages RAM
          58678 pages reserved
          169065 pages shared
          173632 pages non-shared
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2693]     0  2693     6005     1324      17        0             0 god
          [ 2754]     0  2754     6003     1320      16        0             0 god
          [ 2811]     0  2811     5992     1304      18        0             0 god
          [ 2874]     0  2874     6005     1323      18        0             0 god
          [ 2935]     0  2935     8720     7742      21        0             0 mal-30
          [ 2976]     0  2976    21520    17577      42        0             0 mal-80
          Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
          Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
      
      We can see that messages dumped by show_free_areas() are longsome and can
      provide so limited info for memcg that just happen oom.
      
      (2) After change
          mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
           .......(call trace)
           [<ffffffff8168a918>] page_fault+0x28/0x30
          Task in /A/B/D killed as a result of limit of /A
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          memory: usage 102400kB, limit 102400kB, failcnt 140
          memory+swap: usage 102400kB, limit 102400kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
          Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2260]     0  2260     6006     1325      18        0             0 god
          [ 2383]     0  2383     6003     1319      17        0             0 god
          [ 2503]     0  2503     6004     1321      18        0             0 god
          [ 2622]     0  2622     6004     1321      16        0             0 god
          [ 2695]     0  2695     8720     7741      22        0             0 mal-30
          [ 2704]     0  2704    21520    17839      43        0             0 mal-80
          Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
          Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
      
      This version provides more pointed info for memcg in "Memory cgroup stats
      for XXX" section.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cf188e
  2. 22 2月, 2013 3 次提交
    • D
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong 提交于
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
    • D
      mm: only enforce stable page writes if the backing device requires it · 1d1d1a76
      Darrick J. Wong 提交于
      Create a helper function to check if a backing device requires stable
      page writes and, if so, performs the necessary wait.  Then, make it so
      that all points in the memory manager that handle making pages writable
      use the helper function.  This should provide stable page write support
      to most filesystems, while eliminating unnecessary waiting for devices
      that don't require the feature.
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own stable page guarantees or they don't block at all.
      The blocking behavior is back to what it was before 3.0 if you don't
      have a disk requiring stable page writes.
      
      Here's the result of using dbench to test latency on ext2:
      
      3.8.0-rc3:
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       WriteX        109347     0.028    59.817
       ReadX         347180     0.004     3.391
       Flush          15514    29.828   287.283
      
      Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
       WriteX        105556     0.029     4.273
       ReadX         335004     0.005     4.112
       Flush          14982    30.540   298.634
      
      Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, the maximum write latency drops considerably with this
      patch enabled.  The other filesystems (ext3/ext4/xfs/btrfs) behave
      similarly, but see the cover letter for those results.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d1d1a76
    • D
      bdi: allow block devices to say that they require stable page writes · 7d311cda
      Darrick J. Wong 提交于
      This patchset ("stable page writes, part 2") makes some key
      modifications to the original 'stable page writes' patchset.  First, it
      provides creators (devices and filesystems) of a backing_dev_info a flag
      that declares whether or not it is necessary to ensure that page
      contents cannot change during writeout.  It is no longer assumed that
      this is true of all devices (which was never true anyway).  Second, the
      flag is used to relaxed the wait_on_page_writeback calls so that wait
      only occurs if the device needs it.  Third, it fixes up the remaining
      disk-backed filesystems to use this improved conditional-wait logic to
      provide stable page writes on those filesystems.
      
      It is hoped that (for people not using checksumming devices, anyway)
      this patchset will give back unnecessary performance decreases since the
      original stable page write patchset went into 3.0.  Sorry about not
      fixing it sooner.
      
      Complaints were registered by several people about the long write
      latencies introduced by the original stable page write patchset.
      Generally speaking, the kernel ought to allocate as little extra memory
      as possible to facilitate writeout, but for people who simply cannot
      wait, a second page stability strategy is (re)introduced: snapshotting
      page contents.  The waiting behavior is still the default strategy; to
      enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
      set.  This flag is used to bandaid^Henable stable page writeback on
      ext3[1], and is not used anywhere else.
      
      Given that there are already a few storage devices and network FSes that
      have rolled their own page stability wait/page snapshot code, it would
      be nice to move towards consolidating all of these.  It seems possible
      that iscsi and raid5 may wish to use the new stable page write support
      to enable zero-copy writeout.
      
      Thank you to Jan Kara for helping fix a couple more filesystems.
      
      Per Andrew Morton's request, here are the result of using dbench to measure
      latencies on ext2:
      
      3.8.0-rc3:
         Operation      Count    AvgLat    MaxLat
         ----------------------------------------
         WriteX        109347     0.028    59.817
         ReadX         347180     0.004     3.391
         Flush          15514    29.828   287.283
      
        Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
         WriteX        105556     0.029     4.273
         ReadX         335004     0.005     4.112
         Flush          14982    30.540   298.634
      
        Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, for ext2 the maximum write latency decreases from ~60ms
      on a laptop hard disk to ~4ms.  I'm not sure why the flush latencies
      increase, though I suspect that being able to dirty pages faster gives
      the flusher more work to do.
      
      On ext4, the average write latency decreases as well as all the maximum
      latencies:
      
      3.8.0-rc3:
         WriteX         85624     0.152    33.078
         ReadX         272090     0.010    61.210
         Flush          12129    36.219   168.260
      
        Throughput 44.8618 MB/sec  4 clients  4 procs  max_latency=168.276 ms
      
      3.8.0-rc3 + patches:
         WriteX         86082     0.141    30.928
         ReadX         273358     0.010    36.124
         Flush          12214    34.800   165.689
      
        Throughput 44.9941 MB/sec  4 clients  4 procs  max_latency=165.722 ms
      
      XFS seems to exhibit similar latency improvements as ext2:
      
      3.8.0-rc3:
         WriteX        125739     0.028   104.343
         ReadX         399070     0.005     4.115
         Flush          17851    25.004   131.390
      
        Throughput 66.0024 MB/sec  4 clients  4 procs  max_latency=131.406 ms
      
      3.8.0-rc3 + patches:
         WriteX        123529     0.028     6.299
         ReadX         392434     0.005     4.287
         Flush          17549    25.120   188.687
      
        Throughput 64.9113 MB/sec  4 clients  4 procs  max_latency=188.704 ms
      
      ...and btrfs, just to round things out, also shows some latency
      decreases:
      
      3.8.0-rc3:
         WriteX         67122     0.083    82.355
         ReadX         212719     0.005     2.828
         Flush           9547    47.561   147.418
      
        Throughput 35.3391 MB/sec  4 clients  4 procs  max_latency=147.433 ms
      
      3.8.0-rc3 + patches:
         WriteX         64898     0.101    71.631
         ReadX         206673     0.005     7.123
         Flush           9190    47.963   219.034
      
        Throughput 34.0795 MB/sec  4 clients  4 procs  max_latency=219.044 ms
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own wait code, or they don't block at all.  The blocking
      behavior is back to what it was before 3.0 if you don't have a disk
      requiring stable page writes.
      
      This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
      xfs.  I've spot-checked 3.8.0-rc4 and seem to be getting the same
      results as -rc3.
      
      [1] The alternative fixes to ext3 include fixing the locking order and
      page bit handling like we did for ext4 (but then why not just use
      ext4?), or setting PG_writeback so early that ext3 becomes extremely
      slow.  I tried that, but the number of write()s I could initiate dropped
      by nearly an order of magnitude.  That was a bit much even for the
      author of the stable page series! :)
      
      This patch:
      
      Creates a per-backing-device flag that tracks whether or not pages must
      be held immutable during writeout.  Eventually it will be used to waive
      wait_for_page_writeback() if nothing requires stable pages.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d311cda
  3. 19 2月, 2013 1 次提交
    • L
      mm: fix pageblock bitmap allocation · 7c45512d
      Linus Torvalds 提交于
      Commit c060f943 ("mm: use aligned zone start for pfn_to_bitidx
      calculation") fixed out calculation of the index into the pageblock
      bitmap when a !SPARSEMEM zome was not aligned to pageblock_nr_pages.
      
      However, the _allocation_ of that bitmap had never taken this alignment
      requirement into accout, so depending on the exact size and alignment of
      the zone, the use of that index could then access past the allocation,
      resulting in some very subtle memory corruption.
      
      This was reported (and bisected) by Ingo Molnar: one of his random
      config builds would hang with certain very specific kernel command line
      options.
      
      In the meantime, commit c060f943 has been marked for stable, so this
      fix needs to be back-ported to the stable kernels that backported the
      commit to use the right alignment.
      Bisected-and-tested-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c45512d
  4. 14 2月, 2013 1 次提交
    • M
      s390/mm: implement software dirty bits · abf09bed
      Martin Schwidefsky 提交于
      The s390 architecture is unique in respect to dirty page detection,
      it uses the change bit in the per-page storage key to track page
      modifications. All other architectures track dirty bits by means
      of page table entries. This property of s390 has caused numerous
      problems in the past, e.g. see git commit ef5d437f
      "mm: fix XFS oops due to dirty pages without buffers on s390".
      
      To avoid future issues in regard to per-page dirty bits convert
      s390 to a fault based software dirty bit detection mechanism. All
      user page table entries which are marked as clean will be hardware
      read-only, even if the pte is supposed to be writable. A write by
      the user process will trigger a protection fault which will cause
      the user pte to be marked as dirty and the hardware read-only bit
      is removed.
      
      With this change the dirty bit in the storage key is irrelevant
      for Linux as a host, but the storage key is still required for
      KVM guests. The effect is that page_test_and_clear_dirty and the
      related code can be removed. The referenced bit in the storage
      key is still used by the page_test_and_clear_young primitive to
      provide page age information.
      
      For page cache pages of mappings with mapping_cap_account_dirty
      there will not be any change in behavior as the dirty bit tracking
      already uses read-only ptes to control the amount of dirty pages.
      Only for swap cache pages and pages of mappings without
      mapping_cap_account_dirty there can be additional protection faults.
      To avoid an excessive number of additional faults the mk_pte
      primitive checks for PageDirty if the pgprot value allows for writes
      and pre-dirties the pte. That avoids all additional faults for
      tmpfs and shmem pages until these pages are added to the swap cache.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      abf09bed
  5. 13 2月, 2013 3 次提交
  6. 08 2月, 2013 2 次提交
  7. 05 2月, 2013 3 次提交
  8. 30 1月, 2013 2 次提交
  9. 24 1月, 2013 1 次提交
  10. 18 1月, 2013 1 次提交
  11. 12 1月, 2013 1 次提交
    • M
      mm: compaction: partially revert capture of suitable high-order page · 8fb74b9f
      Mel Gorman 提交于
      Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when
      waiting for POLLIN on a local TCP socket.  It was easier to trigger if
      there was disk IO and dirty pages at the same time and he bisected it to
      commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available").
      
      The intention of that patch was to improve high-order allocations under
      memory pressure after changes made to reclaim in 3.6 drastically hurt
      THP allocations but the approach was flawed.  For Eric, the problem was
      that page->pfmemalloc was not being cleared for captured pages leading
      to a poor interaction with swap-over-NFS support causing the packets to
      be dropped.  However, I identified a few more problems with the patch
      including the fact that it can increase contention on zone->lock in some
      cases which could result in async direct compaction being aborted early.
      
      In retrospect the capture patch took the wrong approach.  What it should
      have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it
      was allocating for THP and avoided races that way.  While the patch was
      showing to improve allocation success rates at the time, the benefit is
      marginal given the relative complexity and it should be revisited from
      scratch in the context of the other reclaim-related changes that have
      taken place since the patch was first written and tested.  This patch
      partially reverts commit 1fb3f8ca ("mm: compaction: capture a
      suitable high-order page immediately when it is made available").
      Reported-and-tested-by: NEric Wong <normalperson@yhbt.net>
      Tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb74b9f