1. 07 11月, 2015 17 次提交
    • K
      mm: pack compound_dtor and compound_order into one word in struct page · f1e61557
      Kirill A. Shutemov 提交于
      The patch halves space occupied by compound_dtor and compound_order in
      struct page.
      
      For compound_order, it's trivial long -> short conversion.
      
      For get_compound_page_dtor(), we now use hardcoded table for destructor
      lookup and store its index in the struct page instead of direct pointer
      to destructor. It shouldn't be a big trouble to maintain the table: we
      have only two destructor and NULL currently.
      
      This patch free up one word in tail pages for reuse. This is preparation
      for the next patch.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1e61557
    • K
      mm: drop page->slab_page · 474e4eea
      Kirill A. Shutemov 提交于
      Since 8456a648 ("slab: use struct page for slab management") nobody
      uses slab_page field in struct page.
      
      Let's drop it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      474e4eea
    • S
      mm: zsmalloc: constify struct zs_pool name · 6f3526d6
      Sergey SENOZHATSKY 提交于
      Constify `struct zs_pool' ->name.
      
      [akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f3526d6
    • D
      zpool: remove redundant zpool->type string, const-ify zpool_get_type · 69e18f4d
      Dan Streetman 提交于
      Make the return type of zpool_get_type const; the string belongs to the
      zpool driver and should not be modified.  Remove the redundant type field
      in the struct zpool; it is private to zpool.c and isn't needed since
      ->driver->type can be used directly.  Add comments indicating strings must
      be null-terminated.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69e18f4d
    • D
      module: export param_free_charp() · 3d9c637f
      Dan Streetman 提交于
      Change the param_free_charp() function from static to exported.
      
      It is used by zswap in the next patch ("zswap: use charp for zswap param
      strings").
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d9c637f
    • M
      mm, fs: introduce mapping_gfp_constraint() · c62d2555
      Michal Hocko 提交于
      There are many places which use mapping_gfp_mask to restrict a more
      generic gfp mask which would be used for allocations which are not
      directly related to the page cache but they are performed in the same
      context.
      
      Let's introduce a helper function which makes the restriction explicit and
      easier to track.  This patch doesn't introduce any functional changes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62d2555
    • A
      include/linux/mmzone.h: reflow comment · 89903327
      Andrew Morton 提交于
      Someone has an 86 column display.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89903327
    • M
      mm: page_alloc: hide some GFP internals and document the bits and flag combinations · dd56b046
      Mel Gorman 提交于
      Andrew stated the following
      
      	We have quite a history of remote parts of the kernel using
      	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
      	to think that this is because we didn't adequately explain the
      	interface.
      
      	And I don't think that gfp.h really improved much in this area as
      	a result of this patchset.  Could you go through it some time and
      	decide if we've adequately documented all this stuff?
      
      This patches first moves some GFP flag combinations that are part of the MM
      internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
      bits under various headings and then documents the flag combinations. It
      will not help callers that are brain damaged but the clarity might motivate
      some fixes and avoid future mistakes.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd56b046
    • M
      mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand · 0aaa29a5
      Mel Gorman 提交于
      High-order watermark checking exists for two reasons -- kswapd high-order
      awareness and protection for high-order atomic requests.  Historically the
      kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
      high-order free pages for as long as possible.  This patch introduces
      MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
      allocations on demand and avoids using those blocks for order-0
      allocations.  This is more flexible and reliable than MIGRATE_RESERVE was.
      
      A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
      allocation request steals a pageblock but limits the total number to 1% of
      the zone.  Callers that speculatively abuse atomic allocations for
      long-lived high-order allocations to access the reserve will quickly fail.
       Note that SLUB is currently not such an abuser as it reclaims at least
      once.  It is possible that the pageblock stolen has few suitable
      high-order pages and will need to steal again in the near future but there
      would need to be strong justification to search all pageblocks for an
      ideal candidate.
      
      The pageblocks are unreserved if an allocation fails after a direct
      reclaim attempt.
      
      The watermark checks account for the reserved pageblocks when the
      allocation request is not a high-order atomic allocation.
      
      The reserved pageblocks can not be used for order-0 allocations.  This may
      allow temporary wastage until a failed reclaim reassigns the pageblock.
      This is deliberate as the intent of the reservation is to satisfy a
      limited number of atomic high-order short-lived requests if the system
      requires them.
      
      The stutter benchmark was used to evaluate this but while it was running
      there was a systemtap script that randomly allocated between 1 high-order
      page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC.  This
      is much larger than the potential reserve and it does not attempt to be
      realistic.  It is intended to stress random high-order allocations from an
      unknown source, show that there is a reduction in failures without
      introducing an anomaly where atomic allocations are more reliable than
      regular allocations.  The amount of memory reserved varied throughout the
      workload as reserves were created and reclaimed under memory pressure.
      The allocation failures once the workload warmed up were as follows;
      
      4.2-rc5-vanilla		70%
      4.2-rc5-atomic-reserve	56%
      
      The failure rate was also measured while building multiple kernels.  The
      failure rate was 14% but is 6% with this patch applied.
      
      Overall, this is a small reduction but the reserves are small relative to
      the number of allocation requests.  In early versions of the patch, the
      failure rate reduced by a much larger amount but that required much larger
      reserves and perversely made atomic allocations seem more reliable than
      regular allocations.
      
      [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aaa29a5
    • M
      mm, page_alloc: remove MIGRATE_RESERVE · 974a786e
      Mel Gorman 提交于
      MIGRATE_RESERVE preserves an old property of the buddy allocator that
      existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
      tended to remain contiguous until the only alternative was to fail the
      allocation.  At the time it was discovered that high-order atomic
      allocations relied on this property so MIGRATE_RESERVE was introduced.  A
      later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
      deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
      Note that this patch in isolation may look like a false regression if
      someone was bisecting high-order atomic allocation failures.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      974a786e
    • M
      mm, page_alloc: delete the zonelist_cache · f77cf4e4
      Mel Gorman 提交于
      The zonelist cache (zlc) was introduced to skip over zones that were
      recently known to be full.  This avoided expensive operations such as the
      cpuset checks, watermark calculations and zone_reclaim.  The situation
      today is different and the complexity of zlc is harder to justify.
      
      1) The cpuset checks are no-ops unless a cpuset is active and in general
         are a lot cheaper.
      
      2) zone_reclaim is now disabled by default and I suspect that was a large
         source of the cost that zlc wanted to avoid. When it is enabled, it's
         known to be a major source of stalling when nodes fill up and it's
         unwise to hit every other user with the overhead.
      
      3) Watermark checks are expensive to calculate for high-order
         allocation requests. Later patches in this series will reduce the cost
         of the watermark checking.
      
      4) The most important issue is that in the current implementation it
         is possible for a failed THP allocation to mark a zone full for order-0
         allocations and cause a fallback to remote nodes.
      
      The last issue could be addressed with additional complexity but as the
      benefit of zlc is questionable, it is better to remove it.  If stalls due
      to zone_reclaim are ever reported then an alternative would be to
      introduce deferring logic based on a timeout inside zone_reclaim itself
      and leave the page allocator fast paths alone.
      
      The impact on page-allocator microbenchmarks is negligible as they don't
      hit the paths where the zlc comes into play.  Most page-reclaim related
      workloads showed no noticeable difference as a result of the removal.
      
      The impact was noticeable in a workload called "stutter".  One part uses a
      lot of anonymous memory, a second measures mmap latency and a third copies
      a large file.  In an ideal world the latency application would not notice
      the mmap latency.  On a 2-node machine the results of this patch are
      
      stutter
                                   4.3.0-rc1             4.3.0-rc1
                                    baseline              nozlc-v4
      Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
      1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
      2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
      3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
      Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
      Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
      Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
      Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
      Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
      Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
      Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
      Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
      Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
      Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
      Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
      Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
      Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
      
      Note that the maximum stall latency went from 24 seconds to 12 which is
      still bad but an improvement.  The milage varies considerably 2-node
      machine on an earlier test went from 494 seconds to 47 seconds and a
      4-node machine that tested an earlier version of this patch went from a
      worst case stall time of 6 seconds to 67ms.  The nature of the benchmark
      is inherently unpredictable as it is hammering the system and the milage
      will vary between machines.
      
      There is a secondary impact with potentially more direct reclaim because
      zones are now being considered instead of being skipped by zlc.  In this
      particular test run it did not occur so will not be described.  However,
      in at least one test the following was observed
      
      1. Direct reclaim rates were higher. This was likely due to direct reclaim
        being entered instead of the zlc disabling a zone and busy looping.
        Busy looping may have the effect of allowing kswapd to make more
        progress and in some cases may be better overall. If this is found then
        the correct action is to put direct reclaimers to sleep on a waitqueue
        and allow kswapd make forward progress. Busy looping on the zlc is even
        worse than when the allocator used to blindly call congestion_wait().
      
      2. There was higher swap activity as direct reclaim was active.
      
      3. Direct reclaim efficiency was lower. This is related to 1 as more
        scanning activity also encountered more pages that could not be
        immediately reclaimed
      
      In that case, the direct page scan and reclaim rates are noticeable but
      it is not considered a problem for a few reasons
      
      1. The test is primarily concerned with latency. The mmap attempts are also
         faulted which means there are THP allocation requests. The ZLC could
         cause zones to be disabled causing the process to busy loop instead
         of reclaiming.  This looks like elevated direct reclaim activity but
         it's the correct action to take based on what processes requested.
      
      2. The test hammers reclaim and compaction heavily. The number of successful
         THP faults is highly variable but affects the reclaim stats. It's not a
         realistic or reasonable measure of page reclaim activity.
      
      3. No other page-reclaim intensive workload that was tested showed a problem.
      
      4. If a workload is identified that benefitted from the busy looping then it
         should be fixed by having direct reclaimers sleep on a wait queue until
         woken by kswapd instead of busy looping. We had this class of problem before
         when congestion_waits() with a fixed timeout was a brain damaged decision
         but happened to benefit some workloads.
      
      If a workload is identified that relied on the zlc to busy loop then it
      should be fixed correctly and have a direct reclaimer sleep on a waitqueue
      until woken by kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f77cf4e4
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • M
      mm: page_alloc: remove GFP_IOFS · 40113370
      Mel Gorman 提交于
      GFP_IOFS was intended to be shorthand for clearing two flags, not a set of
      allocation flags.  There is only one user of this flag combination now and
      there appears to be no reason why Lustre had to be protected from reclaim
      stalls.  As none of the sites appear to be atomic, this patch simply
      deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or
      GFP_NOIO as appropriate.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40113370
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
    • M
      mm, page_alloc: use masks and shifts when converting GFP flags to migrate types · 016c13da
      Mel Gorman 提交于
      This patch redefines which GFP bits are used for specifying mobility and
      the order of the migrate types.  Once redefined it's possible to convert
      GFP flags to a migrate type with a simple mask and shift.  The only
      downside is that readers of OOM kill messages and allocation failures may
      have been used to the existing values but scripts/gfp-translate will help.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      016c13da
    • M
      mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled · 46e700ab
      Mel Gorman 提交于
      There is a seqcounter that protects against spurious allocation failures
      when a task is changing the allowed nodes in a cpuset.  There is no need
      to check the seqcounter until a cpuset exists.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46e700ab
    • M
      mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe · e2b19197
      Mel Gorman 提交于
      Overall, the intent of this series is to remove the zonelist cache which
      was introduced to avoid high overhead in the page allocator.  Once this is
      done, it is necessary to reduce the cost of watermark checks.
      
      The series starts with minor micro-optimisations.
      
      Next it notes that GFP flags that affect watermark checks are abused.
      __GFP_WAIT historically identified callers that could not sleep and could
      access reserves.  This was later abused to identify callers that simply
      prefer to avoid sleeping and have other options.  A patch distinguishes
      between atomic callers, high-priority callers and those that simply wish
      to avoid sleep.
      
      The zonelist cache has been around for a long time but it is of dubious
      merit with a lot of complexity and some issues that are explained.  The
      most important issue is that a failed THP allocation can cause a zone to
      be treated as "full".  This potentially causes unnecessary stalls, reclaim
      activity or remote fallbacks.  The issues could be fixed but it's not
      worth it.  The series places a small number of other micro-optimisations
      on top before examining GFP flags watermarks.
      
      High-order watermarks enforcement can cause high-order allocations to fail
      even though pages are free.  The watermark checks both protect high-order
      atomic allocations and make kswapd aware of high-order pages but there is
      a much better way that can be handled using migrate types.  This series
      uses page grouping by mobility to reserve pageblocks for high-order
      allocations with the size of the reservation depending on demand.  kswapd
      awareness is maintained by examining the free lists.  By patch 12 in this
      series, there are no high-order watermark checks while preserving the
      properties that motivated the introduction of the watermark checks.
      
      This patch (of 10):
      
      No user of zone_watermark_ok_safe() specifies alloc_flags.  This patch
      removes the unnecessary parameter.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2b19197
  2. 06 11月, 2015 23 次提交
    • E
      mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage · b0f205c2
      Eric B Munson 提交于
      The previous patch introduced a flag that specified pages in a VMA should
      be placed on the unevictable LRU, but they should not be made present when
      the area is created.  This patch adds the ability to set this state via
      the new mlock system calls.
      
      We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
      MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
      MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
      When used with MCL_CURRENT, all current mappings will be marked with
      VM_LOCKED | VM_LOCKONFAULT.  When used with MCL_FUTURE, the mm->def_flags
      will be marked with VM_LOCKED | VM_LOCKONFAULT.  When used with both
      MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
      marked with VM_LOCKED | VM_LOCKONFAULT.
      
      Prior to this patch, mlockall() will unconditionally clear the
      mm->def_flags any time it is called without MCL_FUTURE.  This behavior is
      maintained after adding MCL_ONFAULT.  If a call to mlockall(MCL_FUTURE) is
      followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
      new VMAs will be unlocked.  This remains true with or without MCL_ONFAULT
      in either mlockall() invocation.
      
      munlock() will unconditionally clear both vma flags.  munlockall()
      unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
      field.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0f205c2
    • E
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson 提交于
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
    • E
      mm: mlock: add new mlock system call · a8ca5d0e
      Eric B Munson 提交于
      With the refactored mlock code, introduce a new system call for mlock.
      The new call will allow the user to specify what lock states are being
      added.  mlock2 is trivial at the moment, but a follow on patch will add a
      new mlock state making it useful.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8ca5d0e
    • T
      mm: remove refresh_cpu_vm_stats() definition for !SMP kernel · 5ba97bf9
      Tetsuo Handa 提交于
      refresh_cpu_vm_stats(int cpu) is no longer referenced by !SMP kernel
      since Linux 3.12.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ba97bf9
    • J
      mm: page_counter: let page_counter_try_charge() return bool · 6071ca52
      Johannes Weiner 提交于
      page_counter_try_charge() currently returns 0 on success and -ENOMEM on
      failure, which is surprising behavior given the function name.
      
      Make it follow the expected pattern of try_stuff() functions that return a
      boolean true to indicate success, or false for failure.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6071ca52
    • H
      mm: rename mem_cgroup_migrate to mem_cgroup_replace_page · 45637bab
      Hugh Dickins 提交于
      After v4.3's commit 0610c25d ("memcg: fix dirty page migration")
      mem_cgroup_migrate() doesn't have much to offer in page migration: convert
      migrate_misplaced_transhuge_page() to set_page_memcg() instead.
      
      Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
      remaining callers are replace_page_cache_page() and shmem_replace_page():
      both of whom passed lrucare true, so just eliminate that argument.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45637bab
    • V
      mm: do not inc NR_PAGETABLE if ptlock_init failed · 706874e9
      Vladimir Davydov 提交于
      If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
      shouldn't increment NR_PAGETABLE.
      
      Since small allocations, such as ptlock, normally do not fail (currently
      they can fail if kmemcg is used though), this patch does not really fix
      anything and should be considered as a code cleanup.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      706874e9
    • V
      memcg: simplify and inline __mem_cgroup_from_kmem · df406551
      Vladimir Davydov 提交于
      Before the previous patch ("memcg: unify slab and other kmem pages
      charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
      pages and pages allocated with alloc_kmem_pages - memcg in the page
      struct.  Now we can unify it.  Since after it, this function becomes tiny
      we can fold it into mem_cgroup_from_kmem.
      
      [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df406551
    • V
      memcg: unify slab and other kmem pages charging · f3ccb2c4
      Vladimir Davydov 提交于
      We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
      uncharging kmem pages to memcg, but currently they are not used for
      charging slab pages (i.e.  they are only used for charging pages allocated
      with alloc_kmem_pages).  The only reason why the slab subsystem uses
      special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
      needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
      to the memcg that the current task belongs to.
      
      To remove this diversity, this patch adds an extra argument to
      __memcg_kmem_charge that can be a pointer to a memcg or NULL.  If it is
      not NULL, the function tries to charge to the memcg it points to,
      otherwise it charge to the current context.  Next, it makes the slab
      subsystem use this function to charge slab pages.
      
      Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
      in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined.  Since
      __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
      don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
      Besides, one can now detect which memcg a slab page belongs to by reading
      /proc/kpagecgroup.
      
      Note, this patch switches slab to charge-after-alloc design.  Since this
      design is already used for all other memcg charges, it should not make any
      difference.
      
      [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3ccb2c4
    • V
      memcg: simplify charging kmem pages · d05e83a6
      Vladimir Davydov 提交于
      Charging kmem pages proceeds in two steps.  First, we try to charge the
      allocation size to the memcg the current task belongs to, then we allocate
      a page and "commit" the charge storing the pointer to the memcg in the
      page struct.
      
      Such a design looks overcomplicated, because there is not much sense in
      trying charging the allocation before actually allocating a page: we won't
      be able to consume much memory over the limit even if we charge after
      doing the actual allocation, besides we already charge user pages post
      factum, so being pedantic with kmem pages just looks pointless.
      
      So this patch simplifies the design by merging the "charge" and the
      "commit" steps into the same function, which takes the allocated page.
      
      Also, rename the charge and uncharge methods to memcg_kmem_charge and
      memcg_kmem_uncharge and make the charge method return error code instead
      of bool to conform to mem_cgroup_try_charge.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d05e83a6
    • Y
      include/linux/vm_event_item.h: change HIGHMEM_ZONE macro definition · f7ae3a95
      yalin wang 提交于
      Change HIGHMEM_ZONE to be the same as the DMA_ZONE macro.
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ae3a95
    • A
      mm/vmstat.c: uninline node_page_state() · c2d42c16
      Andrew Morton 提交于
      With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
      (4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
      stack.  Uninlining node_page_state() reduces this to 440 bytes.
      
      The stack consumption issue is fixed by newer gcc (4.8.4) however with
      that compiler this patch reduces the node.o text size from 7314 bytes to
      4578.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2d42c16
    • V
      mm: optimize PageHighMem() check · 3ca65c19
      Vineet Gupta 提交于
      This came up when implementing HIHGMEM/PAE40 for ARC.  The kmap() /
      kmap_atomic() generated code seemed needlessly bloated due to the way
      PageHighMem() macro is implemented.  It derives the exact zone for page
      and then does pointer subtraction with first zone to infer the zone_type.
      The pointer arithmatic in turn generates the code bloat.
      
      PageHighMem(page)
        is_highmem(page_zone(page))
           zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones
      
      Instead use is_highmem_idx() to work on zone_type available in page flags
      
         ----- Before -----
      80756348:	mov_s      r13,r0
      8075634a:	ld_s       r2,[r13,0]
      8075634c:	lsr_s      r2,r2,30
      8075634e:	mpy        r2,r2,0x2a4
      80756352:	add_s      r2,r2,0x80aef880
      80756358:	ld_s       r3,[r2,28]
      8075635a:	sub_s      r2,r2,r3
      8075635c:	breq       r2,0x2a4,80756378 <kmap+0x48>
      80756364:	breq       r2,0x548,80756378 <kmap+0x48>
      
         ----- After  -----
      80756330:	mov_s      r13,r0
      80756332:	ld_s       r2,[r13,0]
      80756334:	lsr_s      r2,r2,30
      80756336:	sub_s      r2,r2,1
      80756338:	brlo       r2,2,80756348 <kmap+0x30>
      
      For x86 defconfig build (32 bit only) it saves around 900 bytes.
      For ARC defconfig with HIGHMEM, it saved around 2K bytes.
      
         ---->8-------
      ./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
      add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
      function                                     old     new   delta
      saveable_page                                162     154      -8
      saveable_highmem_page                        154     146      -8
      skb_gro_reset_offset                         147     131     -16
      ...
      ...
      __change_page_attr_set_clr                  1715    1678     -37
      setup_data_read                              434     394     -40
      mon_bin_event                               1967    1927     -40
      swsusp_save                                 1148    1105     -43
      _set_pages_array                             549     493     -56
         ---->8-------
      
      e.g. For ARC kmap()
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jennifer Herbert <jennifer.herbert@citrix.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ca65c19
    • D
      mm, oom: remove task_lock protecting comm printing · da39da3a
      David Rientjes 提交于
      The oom killer takes task_lock() in a couple of places solely to protect
      printing the task's comm.
      
      A process's comm, including current's comm, may change due to
      /proc/pid/comm or PR_SET_NAME.
      
      The comm will always be NULL-terminated, so the worst race scenario would
      only be during update.  We can tolerate a comm being printed that is in
      the middle of an update to avoid taking the lock.
      
      Other locations in the kernel have already dropped task_lock() when
      printing comm, so this is consistent.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da39da3a
    • V
      mm, compaction: distinguish contended status in tracepoints · 2d1e1041
      Vlastimil Babka 提交于
      Compaction returns prematurely with COMPACT_PARTIAL when contended or has
      fatal signal pending.  This is ok for the callers, but might be misleading
      in the traces, as the usual reason to return COMPACT_PARTIAL is that we
      think the allocation should succeed.  After this patch we distinguish the
      premature ending condition in the mm_compaction_finished and
      mm_compaction_end tracepoints.
      
      The contended status covers the following reasons:
      - lock contention or need_resched() detected in async compaction
      - fatal signal pending
      - too many pages isolated in the zone (only for async compaction)
      Further distinguishing the exact reason seems unnecessary for now.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d1e1041
    • V
      mm, compaction: export tracepoints zone names to userspace · 1743d050
      Vlastimil Babka 提交于
      Some compaction tracepoints use zone->name to print which zone is being
      compacted.  This works for in-kernel printing, but not userspace trace
      printing of raw captured trace such as via trace-cmd report.
      
      This patch uses zone_idx() instead of zone->name as the raw value, and
      when printing, converts the zone_type to string using the appropriate EM()
      macros and some ugly tricks to overcome the problem that half the values
      depend on CONFIG_ options and one does not simply use #ifdef inside of
      #define.
      
      trace-cmd output before:
      transhuge-stres-4235  [000]   453.149280: mm_compaction_finished: node=0
      zone=ffffffff81815d7a order=9 ret=partial
      
      after:
      transhuge-stres-4235  [000]   453.149280: mm_compaction_finished: node=0
      zone=Normal   order=9 ret=partial
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Valentin Rothberg <valentinrothberg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1743d050
    • V
      mm, compaction: export tracepoints status strings to userspace · fa6c7b46
      Vlastimil Babka 提交于
      Some compaction tracepoints convert the integer return values to strings
      using the compaction_status_string array.  This works for in-kernel
      printing, but not userspace trace printing of raw captured trace such as
      via trace-cmd report.
      
      This patch converts the private array to appropriate tracepoint macros
      that result in proper userspace support.
      
      trace-cmd output before:
      transhuge-stres-4235  [000]   453.149280: mm_compaction_finished: node=0
        zone=ffffffff81815d7a order=9 ret=
      
      after:
      transhuge-stres-4235  [000]   453.149280: mm_compaction_finished: node=0
        zone=ffffffff81815d7a order=9 ret=partial
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa6c7b46
    • Y
      mm/memcontrol: make mem_cgroup_inactive_anon_is_low() return bool · 13308ca9
      Yaowei Bai 提交于
      Make mem_cgroup_inactive_anon_is_low return bool due to this particular
      function only using either one or zero as its return value.
      
      No functional change.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13308ca9
    • J
      mm/filemap.c: make global sync not clear error status of individual inodes · aa750fd7
      Junichi Nomura 提交于
      filemap_fdatawait() is a function to wait for on-going writeback to
      complete but also consume and clear error status of the mapping set during
      writeback.
      
      The latter functionality is critical for applications to detect writeback
      error with system calls like fsync(2)/fdatasync(2).
      
      However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
      which don't check error status of individual mappings.
      
      As a result, fsync() may not be able to detect writeback error if events
      happen in the following order:
      
         Application                    System admin
         ----------------------------------------------------------
         write data on page cache
                                        Run sync command
                                        writeback completes with error
                                        filemap_fdatawait() clears error
         fsync returns success
         (but the data is not on disk)
      
      This patch adds filemap_fdatawait_keep_errors() for call sites where
      writeback error is not handled so that they don't clear error status.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa750fd7
    • N
      mm: hugetlb: proc: add HugetlbPages field to /proc/PID/status · 5d317b2b
      Naoya Horiguchi 提交于
      Currently there's no easy way to get per-process usage of hugetlb pages,
      which is inconvenient because userspace applications which use hugetlb
      typically want to control their processes on the basis of how much memory
      (including hugetlb) they use.  So this patch simply provides easy access
      to the info via /proc/PID/status.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJoern Engel <joern@logfs.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d317b2b
    • R
      mm: use only per-device readahead limit · 600e19af
      Roman Gushchin 提交于
      Maximal readahead size is limited now by two values:
       1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
       2) by configurable per-device value* (bdi->ra_pages)
      
      There are devices, which require custom readahead limit.
      For instance, for RAIDs it's calculated as number of devices
      multiplied by chunk size times 2.
      
      Readahead size can never be larger than bdi->ra_pages * 2 value
      (POSIX_FADV_SEQUNTIAL doubles readahead size).
      
      If so, why do we need two limits?
      I suggest to completely remove this max_sane_readahead() stuff and
      use per-device readahead limit everywhere.
      
      Also, using right readahead size for RAID disks can significantly
      increase i/o performance:
      
      before:
        dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
      
      after:
        $ dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
      
      (It's an 8-disks RAID5 storage).
      
      This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
      behavior introduced by 6d2be915 ("mm/readahead.c: fix readahead
      failure for memoryless NUMA nodes and limit readahead pages").
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      600e19af
    • Y
      mm/page_alloc: remove unused parameter in init_currently_empty_zone() · b171e409
      Yaowei Bai 提交于
      Commit a2f3aa02 ("[PATCH] Fix sparsemem on Cell") fixed an oops
      experienced on the Cell architecture when init-time functions,
      early_*(), are called at runtime by introducing an 'enum memmap_context'
      parameter to memmap_init_zone() and init_currently_empty_zone().  This
      parameter is intended to be used to tell whether the call of these two
      functions is being made on behalf of a hotplug event, or happening at
      boot-time.  However, init_currently_empty_zone() does not use this
      parameter at all, so remove it.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b171e409
    • A
      mm/memblock: make memblock_remove_range() static · 35bd16a2
      Alexander Kuleshov 提交于
      memblock_remove_range() is only used in the mm/memblock.c, so we can make
      it static.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35bd16a2