1. 27 7月, 2016 32 次提交
    • N
      mm: thp: check pmd_trans_unstable() after split_huge_pmd() · 337d9abf
      Naoya Horiguchi 提交于
      split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
      to pte entries, which can be checked with pmd_trans_unstable().  Some
      callers make this assertion and some do it differently and some not, so
      let's do it in a unified manner.
      
      Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      337d9abf
    • J
      mm/page_isolation: clean up confused code · e3a2713c
      Joonsoo Kim 提交于
      When there is an isolated_page, post_alloc_hook() is called with page
      but __free_pages() is called with isolated_page.  Since they are the
      same so no problem but it's very confusing.  To reduce it, this patch
      changes isolated_page to boolean type and uses page variable
      consistently.
      
      Link: http://lkml.kernel.org/r/1466150259-27727-10-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3a2713c
    • J
      mm/page_alloc: introduce post allocation processing on page allocator · 46f24fd8
      Joonsoo Kim 提交于
      This patch is motivated from Hugh and Vlastimil's concern [1].
      
      There are two ways to get freepage from the allocator.  One is using
      normal memory allocation API and the other is __isolate_free_page()
      which is internally used for compaction and pageblock isolation.  Later
      usage is rather tricky since it doesn't do whole post allocation
      processing done by normal API.
      
      One problematic thing I already know is that poisoned page would not be
      checked if it is allocated by __isolate_free_page().  Perhaps, there
      would be more.
      
      We could add more debug logic for allocated page in the future and this
      separation would cause more problem.  I'd like to fix this situation at
      this time.  Solution is simple.  This patch commonize some logic for
      newly allocated page and uses it on all sites.  This will solve the
      problem.
      
      [1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E
      
      [iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
        Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
        Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46f24fd8
    • J
      mm/page_owner: use stackdepot to store stacktrace · f2ca0b55
      Joonsoo Kim 提交于
      Currently, we store each page's allocation stacktrace on corresponding
      page_ext structure and it requires a lot of memory.  This causes the
      problem that memory tight system doesn't work well if page_owner is
      enabled.  Moreover, even with this large memory consumption, we cannot
      get full stacktrace because we allocate memory at boot time and just
      maintain 8 stacktrace slots to balance memory consumption.  We could
      increase it to more but it would make system unusable or change system
      behaviour.
      
      To solve the problem, this patch uses stackdepot to store stacktrace.
      It obviously provides memory saving but there is a drawback that
      stackdepot could fail.
      
      stackdepot allocates memory at runtime so it could fail if system has
      not enough memory.  But, most of allocation stack are generated at very
      early time and there are much memory at this time.  So, failure would
      not happen easily.  And, one failure means that we miss just one page's
      allocation stacktrace so it would not be a big problem.  In this patch,
      when memory allocation failure happens, we store special stracktrace
      handle to the page that is failed to save stacktrace.  With it, user can
      guess memory usage properly even if failure happens.
      
      Memory saving looks as following.  (4GB memory system with page_owner)
      (before the patch -> after the patch)
      
      static allocation:
      92274688 bytes -> 25165824 bytes
      
      dynamic allocation after boot + kernel build:
      0 bytes -> 327680 bytes
      
      total:
      92274688 bytes -> 25493504 bytes
      
      72% reduction in total.
      
      Note that implementation looks complex than someone would imagine
      because there is recursion issue.  stackdepot uses page allocator and
      page_owner is called at page allocation.  Using stackdepot in page_owner
      could re-call page allcator and then page_owner.  That is a recursion.
      To detect and avoid it, whenever we obtain stacktrace, recursion is
      checked and page_owner is set to dummy information if found.  Dummy
      information means that this page is allocated for page_owner feature
      itself (such as stackdepot) and it's understandable behavior for user.
      
      [iamjoonsoo.kim@lge.com: mm-page_owner-use-stackdepot-to-store-stacktrace-v3]
        Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com
        Link: http://lkml.kernel.org/r/1466150259-27727-7-git-send-email-iamjoonsoo.kim@lge.com
      Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2ca0b55
    • J
      mm/page_owner: introduce split_page_owner and replace manual handling · a9627bc5
      Joonsoo Kim 提交于
      split_page() calls set_page_owner() to set up page_owner to each pages.
      But, it has a drawback that head page and the others have different
      stacktrace because callsite of set_page_owner() is slightly differnt.
      To avoid this problem, this patch copies head page's page_owner to the
      others.  It needs to introduce new function, split_page_owner() but it
      also remove the other function, get_page_owner_gfp() so looks good to
      do.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9627bc5
    • J
      mm/page_owner: copy last_migrate_reason in copy_page_owner() · a8efe1c9
      Joonsoo Kim 提交于
      Currently, copy_page_owner() doesn't copy all the owner information.  It
      skips last_migrate_reason because copy_page_owner() is used for
      migration and it will be properly set soon.  But, following patch will
      use copy_page_owner() and this skip will cause the problem that
      allocated page has uninitialied last_migrate_reason.  To prevent it,
      this patch also copy last_migrate_reason in copy_page_owner().
      
      Link: http://lkml.kernel.org/r/1464230275-25791-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8efe1c9
    • J
      mm/page_owner: initialize page owner without holding the zone lock · 83358ece
      Joonsoo Kim 提交于
      It's not necessary to initialized page_owner with holding the zone lock.
      It would cause more contention on the zone lock although it's not a big
      problem since it is just debug feature.  But, it is better than before
      so do it.  This is also preparation step to use stackdepot in page owner
      feature.  Stackdepot allocates new pages when there is no reserved space
      and holding the zone lock in this case will cause deadlock.
      
      Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83358ece
    • J
      mm/compaction: split freepages without holding the zone lock · 66c64223
      Joonsoo Kim 提交于
      We don't need to split freepages with holding the zone lock.  It will
      cause more contention on zone lock so not desirable.
      
      [rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
      Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66c64223
    • M
      zsmalloc: use OBJ_TAG_BIT for bit shifter · 3b1d9ca6
      Minchan Kim 提交于
      Static check warns using tag as bit shifter.  It doesn't break current
      working but not good for redability.  Let's use OBJ_TAG_BIT as bit
      shifter instead of OBJ_ALLOCATED_TAG.
      
      Link: http://lkml.kernel.org/r/20160607045146.GF26230@bboxSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b1d9ca6
    • M
      zsmalloc: page migration support · 48b4800a
      Minchan Kim 提交于
      This patch introduces run-time migration feature for zspage.
      
      For migration, VM uses page.lru field so it would be better to not use
      page.next field which is unified with page.lru for own purpose.  For
      that, firstly, we can get first object offset of the page via runtime
      calculation instead of using page.index so we can use page.index as link
      for page chaining instead of page.next.
      
      In case of huge object, it stores handle to page.index instead of next
      link of page chaining because huge object doesn't need to next link for
      page chaining.  So get_next_page need to identify huge object to return
      NULL.  For it, this patch uses PG_owner_priv_1 flag of the page flag.
      
      For migration, it supports three functions
      
      * zs_page_isolate
      
      It isolates a zspage which includes a subpage VM want to migrate from
      class so anyone cannot allocate new object from the zspage.
      
      We could try to isolate a zspage by the number of subpage so subsequent
      isolation trial of other subpage of the zpsage shouldn't fail.  For
      that, we introduce zspage.isolated count.  With that, zs_page_isolate
      can know whether zspage is already isolated or not for migration so if
      it is isolated for migration, subsequent isolation trial can be
      successful without trying further isolation.
      
      * zs_page_migrate
      
      First of all, it holds write-side zspage->lock to prevent migrate other
      subpage in zspage.  Then, lock all objects in the page VM want to
      migrate.  The reason we should lock all objects in the page is due to
      race between zs_map_object and zs_page_migrate.
      
        zs_map_object				zs_page_migrate
      
        pin_tag(handle)
        obj = handle_to_obj(handle)
        obj_to_location(obj, &page, &obj_idx);
      
      					write_lock(&zspage->lock)
      					if (!trypin_tag(handle))
      						goto unpin_object
      
        zspage = get_zspage(page);
        read_lock(&zspage->lock);
      
      If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be
      stale by migration so it goes crash.
      
      If it locks all of objects successfully, it copies content from old page
      to new one, finally, create new zspage chain with new page.  And if it's
      last isolated subpage in the zspage, put the zspage back to class.
      
      * zs_page_putback
      
      It returns isolated zspage to right fullness_group list if it fails to
      migrate a page.  If it find a zspage is ZS_EMPTY, it queues zspage
      freeing to workqueue.  See below about async zspage freeing.
      
      This patch introduces asynchronous zspage free.  The reason to need it
      is we need page_lock to clear PG_movable but unfortunately, zs_free path
      should be atomic so the apporach is try to grab page_lock.  If it got
      page_lock of all of pages successfully, it can free zspage immediately.
      Otherwise, it queues free request and free zspage via workqueue in
      process context.
      
      If zs_free finds the zspage is isolated when it try to free zspage, it
      delays the freeing until zs_page_putback finds it so it will free free
      the zspage finally.
      
      In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.  First
      of all, it will use ZS_EMPTY list for delay freeing.  And with adding
      ZS_FULL list, it makes to identify whether zspage is isolated or not via
      list_empty(&zspage->list) test.
      
      [minchan@kernel.org: zsmalloc: keep first object offset in struct page]
        Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org
      [minchan@kernel.org: zsmalloc: zspage sanity check]
        Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox
      Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48b4800a
    • M
      zsmalloc: use freeobj for index · bfd093f5
      Minchan Kim 提交于
      Zsmalloc stores first free object's <PFN, obj_idx> position into freeobj
      in each zspage.  If we change it with index from first_page instead of
      position, it makes page migration simple because we don't need to
      correct other entries for linked list if a page is migrated out.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-11-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfd093f5
    • M
      zsmalloc: separate free_zspage from putback_zspage · 4aa409ca
      Minchan Kim 提交于
      Currently, putback_zspage does free zspage under class->lock if fullness
      become ZS_EMPTY but it makes trouble to implement locking scheme for new
      zspage migration.  So, this patch is to separate free_zspage from
      putback_zspage and free zspage out of class->lock which is preparation
      for zspage migration.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-10-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4aa409ca
    • M
      zsmalloc: introduce zspage structure · 3783689a
      Minchan Kim 提交于
      We have squeezed meta data of zspage into first page's descriptor.  So,
      to get meta data from subpage, we should get first page first of all.
      But it makes trouble to implment page migration feature of zsmalloc
      because any place where to get first page from subpage can be raced with
      first page migration.  IOW, first page it got could be stale.  For
      preventing it, I have tried several approahces but it made code
      complicated so finally, I concluded to separate metadata from first
      page.  Of course, it consumes more memory.  IOW, 16bytes per zspage on
      32bit at the moment.  It means we lost 1% at *worst case*(40B/4096B)
      which is not bad I think at the cost of maintenance.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3783689a
    • M
      zsmalloc: factor page chain functionality out · bdb0af7c
      Minchan Kim 提交于
      For page migration, we need to create page chain of zspage dynamically
      so this patch factors it out from alloc_zspage.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-8-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdb0af7c
    • M
      zsmalloc: use accessor · 4f42047b
      Minchan Kim 提交于
      Upcoming patch will change how to encode zspage meta so for easy review,
      this patch wraps code to access metadata as accessor.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-7-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f42047b
    • M
      zsmalloc: use bit_spin_lock · 1b8320b6
      Minchan Kim 提交于
      Use kernel standard bit spin-lock instead of custom mess.  Even, it has
      a bug which doesn't disable preemption.  The reason we don't have any
      problem is that we have used it during preemption disable section by
      class->lock spinlock.  So no need to go to stable.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-6-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b8320b6
    • M
      zsmalloc: keep max_object in size_class · 1fc6e27d
      Minchan Kim 提交于
      Every zspage in a size_class has same number of max objects so we could
      move it to a size_class.
      
      Link: http://lkml.kernel.org/r/1464736881-24886-5-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fc6e27d
    • M
      mm: balloon: use general non-lru movable page feature · b1123ea6
      Minchan Kim 提交于
      Now, VM has a feature to migrate non-lru movable pages so balloon
      doesn't need custom migration hooks in migrate.c and compaction.c.
      
      Instead, this patch implements the page->mapping->a_ops->
      {isolate|migrate|putback} functions.
      
      With that, we could remove hooks for ballooning in general migration
      functions and make balloon compaction simple.
      
      [akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
      Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1123ea6
    • M
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim 提交于
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.orgSigned-off-by: NGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
    • M
      mm: use put_page() to free page instead of putback_lru_page() · c6c919eb
      Minchan Kim 提交于
      Recently, I got many reports about perfermance degradation in embedded
      system(Android mobile phone, webOS TV and so on) and easy fork fail.
      
      The problem was fragmentation caused by zram and GPU driver mainly.
      With memory pressure, their pages were spread out all of pageblock and
      it cannot be migrated with current compaction algorithm which supports
      only LRU pages.  In the end, compaction cannot work well so reclaimer
      shrinks all of working set pages.  It made system very slow and even to
      fail to fork easily which requires order-[2 or 3] allocations.
      
      Other pain point is that they cannot use CMA memory space so when OOM
      kill happens, I can see many free pages in CMA area, which is not memory
      efficient.  In our product which has big CMA memory, it reclaims zones
      too exccessively to allocate GPU and zram page although there are lots
      of free space in CMA so system becomes very slow easily.
      
      To solve these problem, this patch tries to add facility to migrate
      non-lru pages via introducing new functions and page flags to help
      migration.
      
      struct address_space_operations {
      	..
      	..
      	bool (*isolate_page)(struct page *, isolate_mode_t);
      	void (*putback_page)(struct page *);
      	..
      }
      
      new page flags
      
      	PG_movable
      	PG_isolated
      
      For details, please read description in "mm: migrate: support non-lru
      movable page migration".
      
      Originally, Gioh Kim had tried to support this feature but he moved so I
      took over the work.  I took many code from his work and changed a little
      bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
      have many credit, too.
      
      And I should mention Chulmin who have tested this patchset heavily so I
      can find many bugs from him.  :)
      
      Thanks, Gioh, Konstantin and Chulmin!
      
      This patchset consists of five parts.
      
      1. clean up migration
        mm: use put_page to free page instead of putback_lru_page
      
      2. add non-lru page migration feature
        mm: migrate: support non-lru movable page migration
      
      3. rework KVM memory-ballooning
        mm: balloon: use general non-lru movable page feature
      
      4. zsmalloc refactoring for preparing page migration
        zsmalloc: keep max_object in size_class
        zsmalloc: use bit_spin_lock
        zsmalloc: use accessor
        zsmalloc: factor page chain functionality out
        zsmalloc: introduce zspage structure
        zsmalloc: separate free_zspage from putback_zspage
        zsmalloc: use freeobj for index
      
      5. zsmalloc page migration
        zsmalloc: page migration support
        zram: use __GFP_MOVABLE for memory allocation
      
      This patch (of 12):
      
      Procedure of page migration is as follows:
      
      First of all, it should isolate a page from LRU and try to migrate the
      page.  If it is successful, it releases the page for freeing.
      Otherwise, it should put the page back to LRU list.
      
      For LRU pages, we have used putback_lru_page for both freeing and
      putback to LRU list.  It's okay because put_page is aware of LRU list so
      if it releases last refcount of the page, it removes the page from LRU
      list.  However, It makes unnecessary operations (e.g., lru_cache_add,
      pagevec and flags operations.  It would be not significant but no worth
      to do) and harder to support new non-lru page migration because put_page
      isn't aware of non-lru page's data structure.
      
      To solve the problem, we can add new hook in put_page with PageMovable
      flags check but it can increase overhead in hot path and needs new
      locking scheme to stabilize the flag check with put_page.
      
      So, this patch cleans it up to divide two semantic(ie, put and putback).
      If migration is successful, use put_page instead of putback_lru_page and
      use putback_lru_page only on failure.  That makes code more readable and
      doesn't add overhead in put_page.
      
      Comment from Vlastimil
       "Yeah, and compaction (perhaps also other migration users) has to drain
        the lru pvec...  Getting rid of this stuff is worth even by itself."
      
      Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6c919eb
    • V
      mm: oom: add memcg to oom_control · 2a966b77
      Vladimir Davydov 提交于
      It's a part of oom context just like allocation order and nodemask, so
      let's move it to oom_control instead of passing it in the argument list.
      
      Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a966b77
    • V
      mm: zap ZONE_OOM_LOCKED · 798fd756
      Vladimir Davydov 提交于
      Not used since oom_lock was instroduced.
      
      Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      798fd756
    • R
      memory-hotplug: more general validation of zone during online · df429ac0
      Reza Arbab 提交于
      When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
      ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.
      
      To be more flexible, use the following criteria instead; to online
      memory from zone X into zone Y,
      
      * Any zones between X and Y must be unused.
      * If X is lower than Y, the onlined memory must lie at the end of X.
      * If X is higher than Y, the onlined memory must lie at the start of X.
      
      Add zone_can_shift() to make this determination.
      
      Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewd-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df429ac0
    • R
      memory-hotplug: add move_pfn_range() · e51e6c8f
      Reza Arbab 提交于
      Add move_pfn_range(), a wrapper to call move_pfn_range_left() or
      move_pfn_range_right().
      
      No functional change. This will be utilized by a later patch.
      
      Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.comSigned-off-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrew Banman <abanman@sgi.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e51e6c8f
    • O
      mm/init: fix zone boundary creation · 90cae1fe
      Oliver O'Halloran 提交于
      As a part of memory initialisation the architecture passes an array to
      free_area_init_nodes() which specifies the max PFN of each memory zone.
      This array is not necessarily monotonic (due to unused zones) so this
      array is parsed to build monotonic lists of the min and max PFN for each
      zone.  ZONE_MOVABLE is special cased here as its limits are managed by
      the mm subsystem rather than the architecture.  Unfortunately, this
      special casing is broken when ZONE_MOVABLE is the not the last zone in
      the zone list.  The core of the issue is:
      
      	if (i == ZONE_MOVABLE)
      		continue;
      	arch_zone_lowest_possible_pfn[i] =
      		arch_zone_highest_possible_pfn[i-1];
      
      As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
      be set to zero.  This patch fixes this bug by adding explicitly tracking
      where the next zone should start rather than relying on the contents
      arch_zone_highest_possible_pfn[].
      
      Thie is low priority.  To get bitten by this you need to enable a zone
      that appears after ZONE_MOVABLE in the zone_type enum.  As far as I can
      tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
      so I can't see this affecting too many people.
      
      I only noticed this because I've been fiddling with ZONE_DEVICE on
      powerpc and 4.6 broke my test kernel.  This bug, in conjunction with the
      changes in Taku Izumi's kernelcore=mirror patch (d91749c1) and
      powerpc being the odd architecture which initialises max_zone_pfn[] to
      ~0ul instead of 0 caused all of system memory to be placed into
      ZONE_DEVICE at boot, followed a panic since device memory cannot be used
      for kernel allocations.  I've already submitted a patch to fix the
      powerpc specific bits, but I figured this should be fixed too.
      
      Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.comSigned-off-by: NOliver O'Halloran <oohall@gmail.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90cae1fe
    • L
      mm/memcontrol.c: remove the useless parameter for mc_handle_swap_pte · 48406ef8
      Li RongQing 提交于
      It seems like this parameter has never been used since being introduced
      by 90254a65 ("memcg: clean up move charge").  Not a big deal because
      I assume the function would get inlined into the caller anyway but why
      not get rid of it.
      
      [mhocko@suse.com: wrote changelog]
        Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.comSigned-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48406ef8
    • W
      mm/slab: use list_move instead of list_del/list_add · de24baec
      Wei Yongjun 提交于
      Using list_move() instead of list_del() + list_add() to avoid needlessly
      poisoning the next and prev values.
      
      Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.comSigned-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de24baec
    • M
      slab: do not panic on invalid gfp_mask · 72baeef0
      Michal Hocko 提交于
      Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
      This is a rather harsh way to announce a non-critical issue.  Allocator
      is free to ignore invalid flags.  Let's simply replace BUG() by
      dump_stack to tell the offender and fixup the mask to move on with the
      allocation request.
      
      This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
      module:
      
        Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
        CPU: 0 PID: 2916 Comm: insmod Tainted: G           O    4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
          dump_stack+0x67/0x90
          cache_alloc_refill+0x201/0x617
          kmem_cache_alloc_trace+0xa7/0x24a
          ? 0xffffffffa0005000
          mymodule_init+0x20/0x1000 [test_slab]
          do_one_initcall+0xe7/0x16c
          ? rcu_read_lock_sched_held+0x61/0x69
          ? kmem_cache_alloc_trace+0x197/0x24a
          do_init_module+0x5f/0x1d9
          load_module+0x1a3d/0x1f21
          ? retint_kernel+0x2d/0x2d
          SyS_init_module+0xe8/0x10e
          ? SyS_init_module+0xe8/0x10e
          do_syscall_64+0x68/0x13f
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72baeef0
    • M
      slab: make GFP_SLAB_BUG_MASK information more human readable · bacdcb34
      Michal Hocko 提交于
      printk offers %pGg for quite some time so let's use it to get a human
      readable list of invalid flags.
      
      The original output would be
        [  429.191962] gfp: 2
      
      after the change
        [  429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)
      
      Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bacdcb34
    • T
      mm: SLUB freelist randomization · 210e7a43
      Thomas Garnier 提交于
      Implements freelist randomization for the SLUB allocator.  It was
      previous implemented for the SLAB allocator.  Both use the same
      configuration option (CONFIG_SLAB_FREELIST_RANDOM).
      
      The list is randomized during initialization of a new set of pages.  The
      order on different freelist sizes is pre-computed at boot for
      performance.  Each kmem_cache has its own randomized freelist.
      
      This security feature reduces the predictability of the kernel SLUB
      allocator against heap overflows rendering attacks much less stable.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      Performance results:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.comSigned-off-by: NThomas Garnier <thgarnie@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      210e7a43
    • T
      mm: reorganize SLAB freelist randomization · 7c00fce9
      Thomas Garnier 提交于
      The kernel heap allocators are using a sequential freelist making their
      allocation predictable.  This predictability makes kernel heap overflow
      easier to exploit.  An attacker can careful prepare the kernel heap to
      control the following chunk overflowed.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      ***Problems that needed solving:
       - Randomize the Freelist (singled linked) used in the SLUB allocator.
       - Ensure good performance to encourage usage.
       - Get best entropy in early boot stage.
      
      ***Parts:
       - 01/02 Reorganize the SLAB Freelist randomization to share elements
         with the SLUB implementation.
       - 02/02 The SLUB Freelist randomization implementation. Similar approach
         than the SLAB but tailored to the singled freelist used in SLUB.
      
      ***Performance data:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      This patch (of 2):
      
      This commit reorganizes the previous SLAB freelist randomization to
      prepare for the SLUB implementation.  It moves functions that will be
      shared to slab_common.
      
      The entropy functions are changed to align with the SLUB implementation,
      now using get_random_(int|long) functions.  These functions were chosen
      because they provide a bit more entropy early on boot and better
      performance when specific arch instructions are not available.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.comSigned-off-by: NThomas Garnier <thgarnie@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c00fce9
    • D
      fs/fs-writeback.c: add a new writeback list for sync · 6c60d2b5
      Dave Chinner 提交于
      wait_sb_inodes() currently does a walk of all inodes in the filesystem
      to find dirty one to wait on during sync.  This is highly inefficient
      and wastes a lot of CPU when there are lots of clean cached inodes that
      we don't need to wait on.
      
      To avoid this "all inode" walk, we need to track inodes that are
      currently under writeback that we need to wait for.  We do this by
      adding inodes to a writeback list on the sb when the mapping is first
      tagged as having pages under writeback.  wait_sb_inodes() can then walk
      this list of "inodes under IO" and wait specifically just for the inodes
      that the current sync(2) needs to wait for.
      
      Define a couple helpers to add/remove an inode from the writeback list
      and call them when the overall mapping is tagged for or cleared from
      writeback.  Update wait_sb_inodes() to walk only the inodes under
      writeback due to the sync.
      
      With this change, filesystem sync times are significantly reduced for
      fs' with largely populated inode caches and otherwise no other work to
      do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
      with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
      than 0.1s when the filesystem is fully clean.
      
      Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NHolger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c60d2b5
  2. 23 7月, 2016 1 次提交
    • J
      mm: memcontrol: fix cgroup creation failure after many small jobs · 73f576c0
      Johannes Weiner 提交于
      The memory controller has quite a bit of state that usually outlives the
      cgroup and pins its CSS until said state disappears.  At the same time
      it imposes a 16-bit limit on the CSS ID space to economically store IDs
      in the wild.  Consequently, when we use cgroups to contain frequent but
      small and short-lived jobs that leave behind some page cache, we quickly
      run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
      fails with -ENOSPC while there are only a few, or even no user-visible
      cgroups in existence.
      
      Although pinning CSSs past cgroup removal is common, there are only two
      instances that actually need an ID after a cgroup is deleted: cache
      shadow entries and swapout records.
      
      Cache shadow entries reference the ID weakly and can deal with the CSS
      having disappeared when it's looked up later.  They pose no hurdle.
      
      Swap-out records do need to pin the css to hierarchically attribute
      swapins after the cgroup has been deleted; though the only pages that
      remain swapped out after offlining are tmpfs/shmem pages.  And those
      references are under the user's control, so they are manageable.
      
      This patch introduces a private 16-bit memcg ID and switches swap and
      cache shadow entries over to using that.  This ID can then be recycled
      after offlining when the CSS remains pinned only by objects that don't
      specifically need it.
      
      This script demonstrates the problem by faulting one cache page in a new
      cgroup and deleting it again:
      
        set -e
        mkdir -p pages
        for x in `seq 128000`; do
          [ $((x % 1000)) -eq 0 ] && echo $x
          mkdir /cgroup/foo
          echo $$ >/cgroup/foo/cgroup.procs
          echo trex >pages/$x
          echo $$ >/cgroup/cgroup.procs
          rmdir /cgroup/foo
        done
      
      When run on an unpatched kernel, we eventually run out of possible IDs
      even though there are no visible cgroups:
      
        [root@ham ~]# ./cssidstress.sh
        [...]
        65000
        mkdir: cannot create directory '/cgroup/foo': No space left on device
      
      After this patch, the IDs get released upon cgroup destruction and the
      cache and css objects get released once memory reclaim kicks in.
      
      [hannes@cmpxchg.org: init the IDR]
        Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
      Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups")
      Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJohn Garcia <john.garcia@mesosphere.io>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: <stable@vger.kernel.org>	[3.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73f576c0
  3. 15 7月, 2016 7 次提交