1. 07 7月, 2017 1 次提交
  2. 09 5月, 2017 5 次提交
  3. 04 5月, 2017 1 次提交
  4. 02 3月, 2017 1 次提交
  5. 25 2月, 2017 1 次提交
  6. 23 2月, 2017 2 次提交
  7. 15 12月, 2016 1 次提交
    • M
      mm, compaction: allow compaction for GFP_NOFS requests · 73e64c51
      Michal Hocko 提交于
      compaction has been disabled for GFP_NOFS and GFP_NOIO requests since
      the direct compaction was introduced by commit 56de7263 ("mm:
      compaction: direct compact when a high-order allocation fails").  The
      main reason is that the migration of page cache pages might recurse back
      to fs/io layer and we could potentially deadlock.  This is overly
      conservative because all the anonymous memory is migrateable in the
      GFP_NOFS context just fine.  This might be a large portion of the memory
      in many/most workkloads.
      
      Remove the GFP_NOFS restriction and make sure that we skip all fs pages
      (those with a mapping) while isolating pages to be migrated.  We cannot
      consider clean fs pages because they might need a metadata update so
      only isolate pages without any mapping for nofs requests.
      
      The effect of this patch will be probably very limited in many/most
      workloads because higher order GFP_NOFS requests are quite rare,
      although different configurations might lead to very different results.
      David Chinner has mentioned a heavy metadata workload with 64kB block
      which to quote him:
      
      : Unfortunately, there was an era of cargo cult configuration tweaks in the
      : Ceph community that has resulted in a large number of production machines
      : with XFS filesystems configured this way.  And a lot of them store large
      : numbers of small files and run under significant sustained memory
      : pressure.
      :
      : I slowly working towards getting rid of these high order allocations and
      : replacing them with the equivalent number of single page allocations, but
      : I haven't got that (complex) change working yet.
      
      We can do the following to simulate that workload:
      $ mkfs.xfs -f -n size=64k <dev>
      $ mount <dev> /mnt/scratch
      $ time ./fs_mark  -D  10000  -S0  -n  100000  -s  0  -L  32 \
              -d  /mnt/scratch/0  -d  /mnt/scratch/1 \
              -d  /mnt/scratch/2  -d  /mnt/scratch/3 \
              -d  /mnt/scratch/4  -d  /mnt/scratch/5 \
              -d  /mnt/scratch/6  -d  /mnt/scratch/7 \
              -d  /mnt/scratch/8  -d  /mnt/scratch/9 \
              -d  /mnt/scratch/10  -d  /mnt/scratch/11 \
              -d  /mnt/scratch/12  -d  /mnt/scratch/13 \
              -d  /mnt/scratch/14  -d  /mnt/scratch/15
      
      and indeed is hammers the system with many high order GFP_NOFS requests as
      per a simle tracepoint during the load:
      $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter
      I am getting
      5287609 order=0
           37 order=1
      1594905 order=2
      3048439 order=3
      6699207 order=4
        66645 order=5
      
      My testing was done in a kvm guest so performance numbers should be
      taken with a grain of salt but there seems to be a difference when the
      patch is applied:
      
      * Original kernel
      FSUse%        Count         Size    Files/sec     App Overhead
           1      1600000            0       4300.1         20745838
           3      3200000            0       4239.9         23849857
           5      4800000            0       4243.4         25939543
           6      6400000            0       4248.4         19514050
           8      8000000            0       4262.1         20796169
           9      9600000            0       4257.6         21288675
          11     11200000            0       4259.7         19375120
          13     12800000            0       4220.7         22734141
          14     14400000            0       4238.5         31936458
          16     16000000            0       4231.5         23409901
          18     17600000            0       4045.3         23577700
          19     19200000            0       2783.4         58299526
          21     20800000            0       2678.2         40616302
          23     22400000            0       2693.5         83973996
      
      and xfs complaining about memory allocation not making progress
      [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)
      [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240)
      [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240)
      [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240)
      [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240)
      
      * Patched kernel
      FSUse%        Count         Size    Files/sec     App Overhead
           1      1600000            0       4289.1         19243934
           3      3200000            0       4241.6         32828865
           5      4800000            0       4248.7         32884693
           6      6400000            0       4314.4         19608921
           8      8000000            0       4269.9         24953292
           9      9600000            0       4270.7         33235572
          11     11200000            0       4346.4         40817101
          13     12800000            0       4285.3         29972397
          14     14400000            0       4297.2         20539765
          16     16000000            0       4219.6         18596767
          18     17600000            0       4273.8         49611187
          19     19200000            0       4300.4         27944451
          21     20800000            0       4270.6         22324585
          22     22400000            0       4317.6         22650382
          24     24000000            0       4065.2         22297964
      
      So the dropdown at Count 19200000 didn't happen and there was only a
      single warning about allocation not making progress
      [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240)
      
      This suggests that the patch has helped even though there is not all that
      much of anonymous memory as the workload mostly generates fs metadata.  I
      assume the success rate would be higher with more anonymous memory which
      should be the case in many workloads.
      
      [akpm@linux-foundation.org: fix comment]
      Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73e64c51
  8. 13 12月, 2016 1 次提交
    • M
      mm, compaction: fix NR_ISOLATED_* stats for pfn based migration · 6afcf8ef
      Ming Ling 提交于
      Since commit bda807d4 ("mm: migrate: support non-lru movable page
      migration") isolate_migratepages_block) can isolate !PageLRU pages which
      would acct_isolated account as NR_ISOLATED_*.  Accounting these non-lru
      pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
      heuristics based on those counters such as pgdat_reclaimable_pages resp.
      too_many_isolated which would lead to unexpected stalls during the
      direct reclaim without any good reason.  Note that
      __alloc_contig_migrate_range can isolate a lot of pages at once.
      
      On mobile devices such as 512M ram android Phone, it may use a big zram
      swap.  In some cases zram(zsmalloc) uses too many non-lru but
      migratedable pages, such as:
      
            MemTotal: 468148 kB
            Normal free:5620kB
            Free swap:4736kB
            Total swap:409596kB
            ZRAM: 164616kB(zsmalloc non-lru pages)
            active_anon:60700kB
            inactive_anon:60744kB
            active_file:34420kB
            inactive_file:37532kB
      
      Fix this by only accounting lru pages to NR_ISOLATED_* in
      isolate_migratepages_block right after they were isolated and we still
      know they were on LRU.  Drop acct_isolated because it is called after
      the fact and we've lost that information.  Batching per-cpu counter
      doesn't make much improvement anyway.  Also make sure that we uncharge
      only LRU pages when putting them back on the LRU in
      putback_movable_pages resp.  when unmap_and_move migrates the page.
      
      [mhocko@suse.com: replace acct_isolated() with direct counting]
      Fixes: bda807d4 ("mm: migrate: support non-lru movable page migration")
      Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.orgSigned-off-by: NMing Ling <ming.ling@spreadtrum.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6afcf8ef
  9. 02 12月, 2016 1 次提交
  10. 08 10月, 2016 12 次提交
  11. 29 7月, 2016 8 次提交
  12. 27 7月, 2016 6 次提交
    • J
      mm/page_alloc: introduce post allocation processing on page allocator · 46f24fd8
      Joonsoo Kim 提交于
      This patch is motivated from Hugh and Vlastimil's concern [1].
      
      There are two ways to get freepage from the allocator.  One is using
      normal memory allocation API and the other is __isolate_free_page()
      which is internally used for compaction and pageblock isolation.  Later
      usage is rather tricky since it doesn't do whole post allocation
      processing done by normal API.
      
      One problematic thing I already know is that poisoned page would not be
      checked if it is allocated by __isolate_free_page().  Perhaps, there
      would be more.
      
      We could add more debug logic for allocated page in the future and this
      separation would cause more problem.  I'd like to fix this situation at
      this time.  Solution is simple.  This patch commonize some logic for
      newly allocated page and uses it on all sites.  This will solve the
      problem.
      
      [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E
      
      [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
        Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
        Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46f24fd8
    • J
      mm/page_owner: initialize page owner without holding the zone lock · 83358ece
      Joonsoo Kim 提交于
      It's not necessary to initialized page_owner with holding the zone lock.
      It would cause more contention on the zone lock although it's not a big
      problem since it is just debug feature.  But, it is better than before
      so do it.  This is also preparation step to use stackdepot in page owner
      feature.  Stackdepot allocates new pages when there is no reserved space
      and holding the zone lock in this case will cause deadlock.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83358ece
    • J
      mm/compaction: split freepages without holding the zone lock · 66c64223
      Joonsoo Kim 提交于
      We don't need to split freepages with holding the zone lock.  It will
      cause more contention on zone lock so not desirable.
      
      [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66c64223
    • M
      zsmalloc: introduce zspage structure · 3783689a
      Minchan Kim 提交于
      We have squeezed meta data of zspage into first page's descriptor.  So,
      to get meta data from subpage, we should get first page first of all.
      But it makes trouble to implment page migration feature of zsmalloc
      because any place where to get first page from subpage can be raced with
      first page migration.  IOW, first page it got could be stale.  For
      preventing it, I have tried several approahces but it made code
      complicated so finally, I concluded to separate metadata from first
      page.  Of course, it consumes more memory.  IOW, 16bytes per zspage on
      32bit at the moment.  It means we lost 1% at *worst case*(40B/4096B)
      which is not bad I think at the cost of maintenance.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3783689a
    • M
      mm: balloon: use general non-lru movable page feature · b1123ea6
      Minchan Kim 提交于
      Now, VM has a feature to migrate non-lru movable pages so balloon
      doesn't need custom migration hooks in migrate.c and compaction.c.
      
      Instead, this patch implements the page->mapping->a_ops->
      {isolate|migrate|putback} functions.
      
      With that, we could remove hooks for ballooning in general migration
      functions and make balloon compaction simple.
      
      [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
      Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1123ea6
    • M
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim 提交于
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4