1. 29 4月, 2016 1 次提交
  2. 18 3月, 2016 4 次提交
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
    • V
      mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker · 0fc9f58a
      Vladimir Davydov 提交于
      It's just convenient to implement a memcg aware shrinker when you know
      that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
      false.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fc9f58a
    • V
      mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
      Vlastimil Babka 提交于
      Similarly to direct reclaim/compaction, kswapd attempts to combine
      reclaim and compaction to attempt making memory allocation of given
      order available.
      
      The details differ from direct reclaim e.g. in having high watermark as
      a goal.  The code involved in kswapd's reclaim/compaction decisions has
      evolved to be quite complex.
      
      Testing reveals that it doesn't actually work in at least one scenario,
      and closer inspection suggests that it could be greatly simplified
      without compromising on the goal (make high-order page available) or
      efficiency (don't reclaim too much).  The simplification relieas of
      doing all compaction in kcompactd, which is simply woken up when high
      watermarks are reached by kswapd's reclaim.
      
      The scenario where kswapd compaction doesn't work was found with mmtests
      test stress-highalloc configured to attempt order-9 allocations without
      direct reclaim, just waking up kswapd.  There was no compaction attempt
      from kswapd during the whole test.  Some added instrumentation shows
      what happens:
      
       - balance_pgdat() sets end_zone to Normal, as it's not balanced
       - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
         it cannot reclaim anything, so sc.nr_reclaimed is 0
       - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
         it merely checks if high watermarks were reached for base pages.
         This is true, so no reclaim is attempted.  For DMA, testorder=0
         wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
       - even though the pgdat_needs_compaction flag wasn't set to false, no
         compaction happens due to the condition sc.nr_reclaimed >
         nr_attempted being false (as 0 < 99)
       - priority-- due to nr_reclaimed being 0, repeat until priority reaches
         0 pgdat_balanced() is false as only the small zone DMA appears
         balanced (curiously in that check, watermark appears OK and
         compaction_suitable() returns COMPACT_PARTIAL, because a lower
         classzone_idx is used there)
      
      Now, even if it was decided that reclaim shouldn't be attempted on the
      DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
      nr_attempted=0) is also false.  The condition really should use >= as
      the comment suggests.  Then there is a mismatch in the check for setting
      pgdat_needs_compaction to false using low watermark, while the rest uses
      high watermark, and who knows what other subtlety.  Hopefully this
      demonstrates that this is unsustainable.
      
      Luckily we can simplify this a lot.  The reclaim/compaction decisions
      make sense for direct reclaim scenario, but in kswapd, our primary goal
      is to reach high watermark in order-0 pages.  Afterwards we can attempt
      compaction just once.  Unlike direct reclaim, we don't reclaim extra
      pages (over the high watermark), the current code already disallows it
      for good reasons.
      
      After this patch, we simply wake up kcompactd to process the pgdat,
      after we have either succeeded or failed to reach the high watermarks in
      kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
      so kcompactd can apply the same criteria to determine which zones are
      worth compacting.  Note that we use the classzone_idx from
      wakeup_kswapd(), not balanced_classzone_idx which can include higher
      zones that kswapd tried to balance too, but didn't consider them in
      pgdat_balanced().
      
      Since kswapd now cannot create high-order pages itself, we need to
      adjust how it determines the zones to be balanced.  The key element here
      is adding a "highorder" parameter to zone_balanced, which, when set to
      false, makes it consider only order-0 watermark instead of the desired
      higher order (this was done previously by kswapd_shrink_zone(), but not
      elsewhere).  This false is passed for example in pgdat_balanced().
      Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
      kcompactd are woken up for a high-order allocation failure.
      
      The last thing is to decide what to do with pageblock_skip bitmap
      handling.  Compaction maintains a pageblock_skip bitmap to record
      pageblocks where isolation recently failed.  This bitmap can be reset by
      three ways:
      
      1) direct compaction is restarting after going through the full deferred cycle
      
      2) kswapd goes to sleep, and some other direct compaction has previously
         finished scanning the whole zone and set zone->compact_blockskip_flush.
         Note that a successful direct compaction clears this flag.
      
      3) compaction was invoked manually via trigger in /proc
      
      The case 2) is somewhat fuzzy to begin with, but after introducing
      kcompactd we should update it.  The check for direct compaction in 1),
      and to set the flush flag in 2) use current_is_kswapd(), which doesn't
      work for kcompactd.  Thus, this patch adds bool direct_compaction to
      compact_control to use in 2).  For the case 1) we remove the check
      completely - unlike the former kswapd compaction, kcompactd does use the
      deferred compaction functionality, so flushing tied to restarting from
      deferred compaction makes sense here.
      
      Note that when kswapd goes to sleep, kcompactd is woken up, so it will
      see the flushed pageblock_skip bits.  This is different from when the
      former kswapd compaction observed the bits and I believe it makes more
      sense.  Kcompactd can afford to be more thorough than a direct
      compaction trying to limit allocation latency, or kswapd whose primary
      goal is to reclaim.
      
      For testing, I used stress-highalloc configured to do order-9
      allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
      on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
      phases 1 and 2 work as usual):
      
      stress-highalloc
                              4.5-rc1+before          4.5-rc1+after
                                   -nodirect              -nodirect
      Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
      Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
      Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
      Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
      Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
      Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
      Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
      
      User                          3166.67        3181.09
      System                        1153.37        1158.25
      Elapsed                       1768.53        1799.37
      
                                  4.5-rc1+before   4.5-rc1+after
                                       -nodirect    -nodirect
      Direct pages scanned                32938        32797
      Kswapd pages scanned              2183166      2202613
      Kswapd pages reclaimed            2152359      2143524
      Direct pages reclaimed              32735        32545
      Percentage direct scans                1%           1%
      THP fault alloc                       579          612
      THP collapse alloc                    304          316
      THP splits                              0            0
      THP fault fallback                    793          778
      THP collapse fail                      11           16
      Compaction stalls                    1013         1007
      Compaction success                     92           67
      Compaction failures                   920          939
      Page migrate success               238457       721374
      Page migrate failure                23021        23469
      Compaction pages isolated          504695      1479924
      Compaction migrate scanned         661390      8812554
      Compaction free scanned          13476658     84327916
      Compaction cost                       262          838
      
      After this patch we see improvements in allocation success rate
      (especially for phase 3) along with increased compaction activity.  The
      compaction stalls (direct compaction) in the interfering kernel builds
      (probably THP's) also decreased somewhat thanks to kcompactd activity,
      yet THP alloc successes improved a bit.
      
      Note that elapsed and user time isn't so useful for this benchmark,
      because of the background interference being unpredictable.  It's just
      to quickly spot some major unexpected differences.  System time is
      somewhat more useful and that didn't increase.
      
      Also (after adjusting mmtests' ftrace monitor):
      
      Time kswapd awake               2547781     2269241
      Time kcompactd awake                  0      119253
      Time direct compacting           939937      557649
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      119099
      
      The decrease of overal time spent compacting appears to not match the
      increased compaction stats.  I suspect the tasks get rescheduled and
      since the ftrace monitor doesn't see that, the reported time is wall
      time, not CPU time.  But arguably direct compactors care about overall
      latency anyway, whether busy compacting or waiting for CPU doesn't
      matter.  And that latency seems to almost halved.
      
      It's also interesting how much time kswapd spent awake just going
      through all the priorities and failing to even try compacting, over and
      over.
      
      We can also configure stress-highalloc to perform both direct
      reclaim/compaction and wakeup kswapd/kcompactd, by using
      GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
      
      stress-highalloc
                              4.5-rc1+before         4.5-rc1+after
                                     -direct               -direct
      Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
      Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
      Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
      Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
      Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
      Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
      Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
      
      User                          3344.73       3246.04
      System                        1194.24       1172.29
      Elapsed                       1838.04       1836.76
      
                                  4.5-rc1+before  4.5-rc1+after
                                         -direct     -direct
      Direct pages scanned               125146      120966
      Kswapd pages scanned              2119757     2135012
      Kswapd pages reclaimed            2073183     2108388
      Direct pages reclaimed             124909      120577
      Percentage direct scans                5%          5%
      THP fault alloc                       599         652
      THP collapse alloc                    323         354
      THP splits                              0           0
      THP fault fallback                    806         793
      THP collapse fail                      17          16
      Compaction stalls                    2457        2025
      Compaction success                    906         518
      Compaction failures                  1551        1507
      Page migrate success              2031423     2360608
      Page migrate failure                32845       40852
      Compaction pages isolated         4129761     4802025
      Compaction migrate scanned       11996712    21750613
      Compaction free scanned         214970969   344372001
      Compaction cost                      2271        2694
      
      In this scenario, this patch doesn't change the overall success rate as
      direct compaction already tries all it can.  There's however significant
      reduction in direct compaction stalls (that is, the number of
      allocations that went into direct compaction).  The number of successes
      (i.e.  direct compaction stalls that ended up with successful
      allocation) is reduced by the same number.  This means the offload to
      kcompactd is working as expected, and direct compaction is reduced
      either due to detecting contention, or compaction deferred by kcompactd.
      In the previous version of this patchset there was some apparent
      reduction of success rate, but the changes in this version (such as
      using sync compaction only), new baseline kernel, and/or averaging
      results from 5 executions (my bet), made this go away.
      
      Ftrace-based stats seem to roughly agree:
      
      Time kswapd awake               2532984     2326824
      Time kcompactd awake                  0      257916
      Time direct compacting           864839      735130
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      257585
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf6242
    • V
      mm, kswapd: remove bogus check of balance_classzone_idx · 81c5857b
      Vlastimil Babka 提交于
      During work on kcompactd integration I have spotted a confusing check of
      balance_classzone_idx, which I believe is bogus.
      
      The balanced_classzone_idx is filled by balance_pgdat() as the highest
      zone it attempted to balance.  This was introduced by commit dc83edd9
      ("mm: kswapd: use the classzone idx that kswapd was using for
      sleeping_prematurely()").
      
      The intention is that (as expressed in today's function names), the
      value used for kswapd_shrink_zone() calls in balance_pgdat() is the same
      as for the decisions in kswapd_try_to_sleep().
      
      An unwanted side-effect of that commit was breaking the checks in
      kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
      classzone_idx.  Commits 215ddd66 ("mm: vmscan: only read
      new_classzone_idx from pgdat when reclaiming successfully") and
      d2ebd0f6 ("kswapd: avoid unnecessary rebalance after an unsuccessful
      balancing") tried to fixed, but apparently introduced a bogus check that
      this patch removes.
      
      Consider zone indexes X < Y < Z, where:
      - Z is the value used for the first kswapd wakeup.
      - Y is returned as balanced_classzone_idx, which means zones with index higher
        than Y (including Z) were found to be unreclaimable.
      - X is the value used for the second kswapd wakeup
      
      The new wakeup with value X means that kswapd is now supposed to balance
      harder all zones with index <= X.  But instead, due to Y < Z, it will go
      sleep and won't read the new value X.  This is subtly wrong.
      
      The effect of this patch is that kswapd will react better in some
      situations, where e.g.  the first wakeup is for ZONE_DMA32, the second is
      for ZONE_DMA, and due to unreclaimable ZONE_NORMAL.  Before this patch,
      kswapd would go sleep instead of reclaiming ZONE_DMA harder.  I expect
      these situations are very rare, and more value is in better
      maintainability due to the removal of confusing and bogus check.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81c5857b
  3. 16 3月, 2016 6 次提交
  4. 06 2月, 2016 1 次提交
  5. 23 1月, 2016 1 次提交
    • R
      dax: support dirty DAX entries in radix tree · f9fe48be
      Ross Zwisler 提交于
      Add support for tracking dirty DAX entries in the struct address_space
      radix tree.  This tree is already used for dirty page writeback, and it
      already supports the use of exceptional (non struct page*) entries.
      
      In order to properly track dirty DAX pages we will insert new
      exceptional entries into the radix tree that represent dirty DAX PTE or
      PMD pages.  These exceptional entries will also contain the writeback
      addresses for the PTE or PMD faults that we can use at fsync/msync time.
      
      There are currently two types of exceptional entries (shmem and shadow)
      that can be placed into the radix tree, and this adds a third.  We rely
      on the fact that only one type of exceptional entry can be found in a
      given radix tree based on its usage.  This happens for free with DAX vs
      shmem but we explicitly prevent shadow entries from being added to radix
      trees for DAX mappings.
      
      The only shadow entries that would be generated for DAX radix trees
      would be to track zero page mappings that were created for holes.  These
      pages would receive minimal benefit from having shadow entries, and the
      choice to have only one type of exceptional entry in a given radix tree
      makes the logic simpler both in clear_exceptional_entry() and in the
      rest of DAX.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Jeff Layton <jlayton@poochiereds.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9fe48be
  6. 21 1月, 2016 5 次提交
  7. 16 1月, 2016 2 次提交
    • M
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim 提交于
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • K
      page-flags: define PG_locked behavior on compound pages · 48c935ad
      Kirill A. Shutemov 提交于
      lock_page() must operate on the whole compound page.  It doesn't make
      much sense to lock part of compound page.  Change code to use head
      page's PG_locked, if tail page is passed.
      
      This patch also gets rid of custom helper functions --
      __set_page_locked() and __clear_page_locked().  They are replaced with
      helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
      helper would trigger VM_BUG_ON().
      
      SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
      appear there.  VM_BUG_ON() is added to make sure that this assumption is
      correct.
      
      [akpm@linux-foundation.org: fix fs/cifs/file.c]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c935ad
  8. 15 1月, 2016 7 次提交
  9. 07 11月, 2015 2 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
    • M
      mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe · e2b19197
      Mel Gorman 提交于
      Overall, the intent of this series is to remove the zonelist cache which
      was introduced to avoid high overhead in the page allocator.  Once this is
      done, it is necessary to reduce the cost of watermark checks.
      
      The series starts with minor micro-optimisations.
      
      Next it notes that GFP flags that affect watermark checks are abused.
      __GFP_WAIT historically identified callers that could not sleep and could
      access reserves.  This was later abused to identify callers that simply
      prefer to avoid sleeping and have other options.  A patch distinguishes
      between atomic callers, high-priority callers and those that simply wish
      to avoid sleep.
      
      The zonelist cache has been around for a long time but it is of dubious
      merit with a lot of complexity and some issues that are explained.  The
      most important issue is that a failed THP allocation can cause a zone to
      be treated as "full".  This potentially causes unnecessary stalls, reclaim
      activity or remote fallbacks.  The issues could be fixed but it's not
      worth it.  The series places a small number of other micro-optimisations
      on top before examining GFP flags watermarks.
      
      High-order watermarks enforcement can cause high-order allocations to fail
      even though pages are free.  The watermark checks both protect high-order
      atomic allocations and make kswapd aware of high-order pages but there is
      a much better way that can be handled using migrate types.  This series
      uses page grouping by mobility to reserve pageblocks for high-order
      allocations with the size of the reservation depending on demand.  kswapd
      awareness is maintained by examining the free lists.  By patch 12 in this
      series, there are no high-order watermark checks while preserving the
      properties that motivated the introduction of the watermark checks.
      
      This patch (of 10):
      
      No user of zone_watermark_ok_safe() specifies alloc_flags.  This patch
      removes the unnecessary parameter.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2b19197
  10. 06 11月, 2015 3 次提交
  11. 23 9月, 2015 1 次提交
  12. 09 9月, 2015 4 次提交
  13. 05 9月, 2015 2 次提交
    • M
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman 提交于
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  14. 05 8月, 2015 1 次提交
    • M
      mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations · ecf5fc6e
      Michal Hocko 提交于
      Nikolay has reported a hang when a memcg reclaim got stuck with the
      following backtrace:
      
      PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
        #0 __schedule at ffffffff815ab152
        #1 schedule at ffffffff815ab76e
        #2 schedule_timeout at ffffffff815ae5e5
        #3 io_schedule_timeout at ffffffff815aad6a
        #4 bit_wait_io at ffffffff815abfc6
        #5 __wait_on_bit at ffffffff815abda5
        #6 wait_on_page_bit at ffffffff8111fd4f
        #7 shrink_page_list at ffffffff81135445
        #8 shrink_inactive_list at ffffffff81135845
        #9 shrink_lruvec at ffffffff81135ead
       #10 shrink_zone at ffffffff811360c3
       #11 shrink_zones at ffffffff81136eff
       #12 do_try_to_free_pages at ffffffff8113712f
       #13 try_to_free_mem_cgroup_pages at ffffffff811372be
       #14 try_charge at ffffffff81189423
       #15 mem_cgroup_try_charge at ffffffff8118c6f5
       #16 __add_to_page_cache_locked at ffffffff8112137d
       #17 add_to_page_cache_lru at ffffffff81121618
       #18 pagecache_get_page at ffffffff8112170b
       #19 grow_dev_page at ffffffff811c8297
       #20 __getblk_slow at ffffffff811c91d6
       #21 __getblk_gfp at ffffffff811c92c1
       #22 ext4_ext_grow_indepth at ffffffff8124565c
       #23 ext4_ext_create_new_leaf at ffffffff81246ca8
       #24 ext4_ext_insert_extent at ffffffff81246f09
       #25 ext4_ext_map_blocks at ffffffff8124a848
       #26 ext4_map_blocks at ffffffff8121a5b7
       #27 mpage_map_one_extent at ffffffff8121b1fa
       #28 mpage_map_and_submit_extent at ffffffff8121f07b
       #29 ext4_writepages at ffffffff8121f6d5
       #30 do_writepages at ffffffff8112c490
       #31 __filemap_fdatawrite_range at ffffffff81120199
       #32 filemap_flush at ffffffff8112041c
       #33 ext4_alloc_da_blocks at ffffffff81219da1
       #34 ext4_rename at ffffffff81229b91
       #35 ext4_rename2 at ffffffff81229e32
       #36 vfs_rename at ffffffff811a08a5
       #37 SYSC_renameat2 at ffffffff811a3ffc
       #38 sys_renameat2 at ffffffff811a408e
       #39 sys_rename at ffffffff8119e51e
       #40 system_call_fastpath at ffffffff815afa89
      
      Dave Chinner has properly pointed out that this is a deadlock in the
      reclaim code because ext4 doesn't submit pages which are marked by
      PG_writeback right away.
      
      The heuristic was introduced by commit e62e384e ("memcg: prevent OOM
      with too many dirty pages") and it was applied only when may_enter_fs
      was specified.  The code has been changed by c3b94f44 ("memcg:
      further prevent OOM with too many dirty pages") which has removed the
      __GFP_FS restriction with a reasoning that we do not get into the fs
      code.  But this is not sufficient apparently because the fs doesn't
      necessarily submit pages marked PG_writeback for IO right away.
      
      ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
      submit the bio.  Instead it tries to map more pages into the bio and
      mpage_map_one_extent might trigger memcg charge which might end up
      waiting on a page which is marked PG_writeback but hasn't been submitted
      yet so we would end up waiting for something that never finishes.
      
      Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
      before we go to wait on the writeback.  The page fault path, which is
      the only path that triggers memcg oom killer since 3.12, shouldn't
      require GFP_NOFS and so we shouldn't reintroduce the premature OOM
      killer issue which was originally addressed by the heuristic.
      
      As per David Chinner the xfs is doing similar thing since 2.6.15 already
      so ext4 is not the only affected filesystem.  Moreover he notes:
      
      : For example: IO completion might require unwritten extent conversion
      : which executes filesystem transactions and GFP_NOFS allocations. The
      : writeback flag on the pages can not be cleared until unwritten
      : extent conversion completes. Hence memory reclaim cannot wait on
      : page writeback to complete in GFP_NOFS context because it is not
      : safe to do so, memcg reclaim or otherwise.
      
      Cc: stable@vger.kernel.org # 3.9+
      [tytso@mit.edu: corrected the control flow]
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecf5fc6e