1. 15 5月, 2015 3 次提交
    • M
      mm, numa: really disable NUMA balancing by default on single node machines · b0dc2b9b
      Mel Gorman 提交于
      NUMA balancing is meant to be disabled by default on UMA machines but
      the check is using nr_node_ids (highest node) instead of
      num_online_nodes (online nodes).
      
      The consequences are that a UMA machine with a node ID of 1 or higher
      will enable NUMA balancing.  This will incur useless overhead due to
      minor faults with the impact depending on the workload.  These are the
      impact on the stats when running a kernel build on a single node machine
      whose node ID happened to be 1:
      
        			       vanilla     patched
        NUMA base PTE updates          5113158           0
        NUMA huge PMD updates              643           0
        NUMA page range updates        5442374           0
        NUMA hint faults               2109622           0
        NUMA hint local faults         2109622           0
        NUMA hint local percent            100         100
        NUMA pages migrated                  0           0
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dc2b9b
    • H
      CMA: page_isolation: check buddy before accessing it · 1ae7013d
      Hui Zhu 提交于
      I had an issue:
      
          Unable to handle kernel NULL pointer dereference at virtual address 0000082a
          pgd = cc970000
          [0000082a] *pgd=00000000
          Internal error: Oops: 5 [#1] PREEMPT SMP ARM
          PC is at get_pageblock_flags_group+0x5c/0xb0
          LR is at unset_migratetype_isolate+0x148/0x1b0
          pc : [<c00cc9a0>]    lr : [<c0109874>]    psr: 80000093
          sp : c7029d00  ip : 00000105  fp : c7029d1c
          r10: 00000001  r9 : 0000000a  r8 : 00000004
          r7 : 60000013  r6 : 000000a4  r5 : c0a357e4  r4 : 00000000
          r3 : 00000826  r2 : 00000002  r1 : 00000000  r0 : 0000003f
          Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
          Control: 10c5387d  Table: 2cb7006a  DAC: 00000015
          Backtrace:
              get_pageblock_flags_group+0x0/0xb0
              unset_migratetype_isolate+0x0/0x1b0
              undo_isolate_page_range+0x0/0xdc
              __alloc_contig_range+0x0/0x34c
              alloc_contig_range+0x0/0x18
      
      This issue is because when calling unset_migratetype_isolate() to unset
      a part of CMA memory, it try to access the buddy page to get its status:
      
      		if (order >= pageblock_order) {
      			page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
      			buddy_idx = __find_buddy_index(page_idx, order);
      			buddy = page + (buddy_idx - page_idx);
      
      			if (!is_migrate_isolate_page(buddy)) {
      
      But the begin addr of this part of CMA memory is very close to a part of
      memory that is reserved at boot time (not in buddy system).  So add a
      check before accessing it.
      
      [akpm@linux-foundation.org: use conventional code layout]
      Signed-off-by: NHui Zhu <zhuhui@xiaomi.com>
      Suggested-by: NLaura Abbott <labbott@redhat.com>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ae7013d
    • V
      gfp: add __GFP_NOACCOUNT · 8f4fc071
      Vladimir Davydov 提交于
      Not all kmem allocations should be accounted to memcg.  The following
      patch gives an example when accounting of a certain type of allocations to
      memcg can effectively result in a memory leak.  This patch adds the
      __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the
      allocation to go through the root cgroup.  It will be used by the next
      patch.
      
      Note, since in case of kmemleak enabled each kmalloc implies yet another
      allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to
      gfp_kmemleak_mask.
      
      Alternatively, we could introduce a per kmem cache flag disabling
      accounting for all allocations of a particular kind, but (a) we would not
      be able to bypass accounting for kmalloc then and (b) a kmem cache with
      this flag set could not be merged with a kmem cache without this flag,
      which would increase the number of global caches and therefore
      fragmentation even if the memory cgroup controller is not used.
      
      Despite its generic name, currently __GFP_NOACCOUNT disables accounting
      only for kmem allocations while user page allocations are always charged.
      To catch abusing of this flag, a warning is issued on an attempt of
      passing it to mem_cgroup_try_charge.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f4fc071
  2. 12 5月, 2015 1 次提交
    • A
      mm/net: Rename and move page fragment handling from net/ to mm/ · b63ae8ca
      Alexander Duyck 提交于
      This change moves the __alloc_page_frag functionality out of the networking
      stack and into the page allocation portion of mm.  The idea it so help make
      this maintainable by placing it with other page allocation functions.
      
      Since we are moving it from skbuff.c to page_alloc.c I have also renamed
      the basic defines and structure from netdev_alloc_cache to page_frag_cache
      to reflect that this is now part of a different kernel subsystem.
      
      I have also added a simple __free_page_frag function which can handle
      freeing the frags based on the skb->head pointer.  The model for this is
      based off of __free_pages since we don't actually need to deal with all of
      the cases that put_page handles.  I incorporated the virt_to_head_page call
      and compound_order into the function as it actually allows for a signficant
      size reduction by reducing code duplication.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b63ae8ca
  3. 06 5月, 2015 4 次提交
  4. 24 4月, 2015 1 次提交
    • T
      writeback: use |1 instead of +1 to protect against div by zero · 464d1387
      Tejun Heo 提交于
      mm/page-writeback.c has several places where 1 is added to the divisor
      to prevent division by zero exceptions; however, if the original
      divisor is equivalent to -1, adding 1 leads to division by zero.
      
      There are three places where +1 is used for this purpose - one in
      pos_ratio_polynom() and two in bdi_position_ratio().  The second one
      in bdi_position_ratio() actually triggered div-by-zero oops on a
      machine running a 3.10 kernel.  The divisor is
      
        x_intercept - bdi_setpoint + 1 == span + 1
      
      span is confirmed to be (u32)-1.  It isn't clear how it ended up that
      but it could be from write bandwidth calculation underflow fixed by
      c72efb65 ("writeback: fix possible underflow in write bandwidth
      calculation").
      
      At any rate, +1 isn't a proper protection against div-by-zero.  This
      patch converts all +1 protections to |1.  Note that
      bdi_update_dirty_ratelimit() was already using |1 before this patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      464d1387
  5. 16 4月, 2015 31 次提交
    • S
      zsmalloc: remove extra cond_resched() in __zs_compact · 160a117f
      Sergey Senozhatsky 提交于
      Do not perform cond_resched() before the busy compaction loop in
      __zs_compact(), because this loop does it when needed.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      160a117f
    • H
      zsmalloc: fix fatal corruption due to wrong size class selection · 81da9b13
      Heesub Shin 提交于
      There is no point in overriding the size class below.  It causes fatal
      corruption on the next chunk on the 3264-bytes size class, which is the
      last size class that is not huge.
      
      For example, if the requested size was exactly 3264 bytes, current
      zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
      not 4096.  User access to this chunk may overwrite head of the next
      adjacent chunk.
      
      Here is the panic log captured when freelist was corrupted due to this:
      
          Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
          Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
          Modules linked in:
          exynos-snapshot: core register saved(CPU:5)
          CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
          exynos-snapshot: context saved(CPU:5)
          exynos-snapshot: item - log_kevents is disabled
          CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
          task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
          PC is at obj_idx_to_offset+0x0/0x1c
          LR is at obj_malloc+0x44/0xe8
          pc : [<ffffffc00030659c>] lr : [<ffffffc000306604>] pstate: a0000045
          sp : ffffffc0b71eb790
          x29: ffffffc0b71eb790 x28: ffffffc00204c000
          x27: 000000000001d96f x26: 0000000000000000
          x25: ffffffc098cc3500 x24: ffffffc0a13f2810
          x23: ffffffc098cc3501 x22: ffffffc0a13f2800
          x21: 000011e1a02006e3 x20: ffffffc0a13f2800
          x19: ffffffbc02a7e000 x18: 0000000000000000
          x17: 0000000000000000 x16: 0000000000000feb
          x15: 0000000000000000 x14: 00000000a01003e3
          x13: 0000000000000020 x12: fffffffffffffff0
          x11: ffffffc08b264000 x10: 00000000e3a01004
          x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
          x7 : ffffffc000307d24 x6 : 0000000000000000
          x5 : 0000000000000038 x4 : 000000000000011e
          x3 : ffffffbc00003e90 x2 : 0000000000000cc0
          x1 : 00000000d0100371 x0 : ffffffbc00003e90
      Reported-by: NSooyong Suk <s.suk@samsung.com>
      Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
      Tested-by: NSooyong Suk <s.suk@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81da9b13
    • M
      zsmalloc: remove unnecessary insertion/removal of zspage in compaction · 839373e6
      Minchan Kim 提交于
      In putback_zspage, we don't need to insert a zspage into list of zspage
      in size_class again to just fix fullness group. We could do directly
      without reinsertion so we could save some instuctions.
      Reported-by: NHeesub Shin <heesub.shin@samsung.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Juneho Choi <juno.choi@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      839373e6
    • S
      zsmalloc: micro-optimize zs_object_copy() · 495819ea
      Sergey Senozhatsky 提交于
      A micro-optimization.  Avoid additional branching and reduce (a bit)
      registry pressure (f.e.  s_off += size; d_off += size; may be calculated
      twise: first for >= PAGE_SIZE check and later for offset update in "else"
      clause).
      
      scripts/bloat-o-meter shows some improvement
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
      function                          old     new   delta
      zs_object_copy                    550     540     -10
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      495819ea
    • S
      zsmalloc: remove synchronize_rcu from zs_compact() · 1ec7cfb1
      Sergey Senozhatsky 提交于
      Do not synchronize rcu in zs_compact(). Neither zsmalloc not
      zram use rcu.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ec7cfb1
    • Y
    • M
      zsmalloc: zsmalloc documentation · d02be50d
      Minchan Kim 提交于
      Create zsmalloc doc which explains design concept and stat information.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02be50d
    • M
      zsmalloc: add fullness into stat · 248ca1b0
      Minchan Kim 提交于
      During investigating compaction, fullness information of each class is
      helpful for investigating how the compaction works well.  With that, we
      could know how compaction works well more clear on each size class.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      248ca1b0
    • M
      zsmalloc: record handle in page->private for huge object · 7b60a685
      Minchan Kim 提交于
      We store handle on header of each allocated object so it increases the
      size of each object by sizeof(unsigned long).
      
      If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
      4104B-class to add handle.
      
      However, 4104B-class has 1-pages_per_zspage so wasted size by internal
      fragment is 8192 - 4104, which is terrible.
      
      So this patch records the handle in page->private on such huge object(ie,
      pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
      object so we could use 4096B-class, not 4104B-class.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b60a685
    • M
      zsmalloc: adjust ZS_ALMOST_FULL · d3d07c92
      Minchan Kim 提交于
      Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
      under 1/4 used objects(ie, fullness_threshold_frac).  It could make result
      in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.
      
      This patch changes the rule so that zsmalloc makes zspage which has above
      3/4 used object ZS_ALMOST_FULL so it could make tight packing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3d07c92
    • M
      zsmalloc: support compaction · 312fcae2
      Minchan Kim 提交于
      This patch provides core functions for migration of zsmalloc.  Migraion
      policy is simple as follows.
      
      for each size class {
              while {
                      src_page = get zs_page from ZS_ALMOST_EMPTY
                      if (!src_page)
                              break;
                      dst_page = get zs_page from ZS_ALMOST_FULL
                      if (!dst_page)
                              dst_page = get zs_page from ZS_ALMOST_EMPTY
                      if (!dst_page)
                              break;
                      migrate(from src_page, to dst_page);
              }
      }
      
      For migration, we need to identify which objects in zspage are allocated
      to migrate them out.  We could know it by iterating of freed objects in a
      zspage because first_page of zspage keeps free objects singly-linked list
      but it's not efficient.  Instead, this patch adds a tag(ie,
      OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
      whether the object is allocated easily.
      
      This patch adds another status bit in handle to synchronize between user
      access through zs_map_object and migration.  During migration, we cannot
      move objects user are using due to data coherency between old object and
      new object.
      
      [akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      312fcae2
    • M
      zsmalloc: factor out obj_[malloc|free] · c7806261
      Minchan Kim 提交于
      In later patch, migration needs some part of functions in zs_malloc and
      zs_free so this patch factor out them.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7806261
    • M
      zsmalloc: decouple handle and object · 2e40e163
      Minchan Kim 提交于
      Recently, we started to use zram heavily and some of issues
      popped.
      
      1) external fragmentation
      
      I got a report from Juneho Choi that fork failed although there are plenty
      of free pages in the system.  His investigation revealed zram is one of
      the culprit to make heavy fragmentation so there was no more contiguous
      16K page for pgd to fork in the ARM.
      
      2) non-movable pages
      
      Other problem of zram now is that inherently, user want to use zram as
      swap in small memory system so they use zRAM with CMA to use memory
      efficiently.  However, unfortunately, it doesn't work well because zRAM
      cannot use CMA's movable pages unless it doesn't support compaction.  I
      got several reports about that OOM happened with zram although there are
      lots of swap space and free space in CMA area.
      
      3) internal fragmentation
      
      zRAM has started support memory limitation feature to limit memory usage
      and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
      harmonized with zram-swap to stop anonymous page reclaim if zram consumed
      memory up to the limit although there are free space on the swap.  One
      problem for that direction is zram has no way to know any hole in memory
      space zsmalloc allocated by internal fragmentation so zram would regard
      swap is full although there are free space in zsmalloc.  For solving the
      issue, zram want to trigger compaction of zsmalloc before it decides full
      or not.
      
      This patchset is first step to support above issues.  For that, it adds
      indirect layer between handle and object location and supports manual
      compaction to solve 3th problem first of all.
      
      After this patchset got merged, next step is to make VM aware of zsmalloc
      compaction so that generic compaction will move zsmalloced-pages
      automatically in runtime.
      
      In my imaginary experiment(ie, high compress ratio data with heavy swap
      in/out on 8G zram-swap), data is as follows,
      
      Before =
      zram allocated object :      60212066 bytes
      zram total used:     140103680 bytes
      ratio:         42.98 percent
      MemFree:          840192 kB
      
      Compaction
      
      After =
      frag ratio after compaction
      zram allocated object :      60212066 bytes
      zram total used:      76185600 bytes
      ratio:         79.03 percent
      MemFree:          901932 kB
      
      Juneho reported below in his real platform with small aging.
      So, I think the benefit would be bigger in real aging system
      for a long time.
      
      - frag_ratio increased 3% (ie, higher is better)
      - memfree increased about 6MB
      - In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased
      
      frag ratio after swap fragment
      used :        156677 kbytes
      total:        166092 kbytes
      frag_ratio :  94
      meminfo before compaction
      MemFree:           83724 kB
      Node 0, zone   Normal  13642   1364     57     10     61     17      9      5      4      0      0
      Node 0, zone  HighMem    425     29      1      0      0      0      0      0      0      0      0
      
      num_migrated :  23630
      compaction done
      
      frag ratio after compaction
      used :        156673 kbytes
      total:        160564 kbytes
      frag_ratio :  97
      meminfo after compaction
      MemFree:           89060 kB
      Node 0, zone   Normal  14076   1544     67     14     61     17      9      5      4      0      0
      Node 0, zone  HighMem    863     50      1      0      0      0      0      0      0      0      0
      
      This patchset adds more logics(about 480 lines) in zsmalloc but when I
      tested heavy swapin/out program, the regression for swapin/out speed is
      marginal because most of overheads were caused by compress/decompress and
      other MM reclaim stuff.
      
      This patch (of 7):
      
      Currently, handle of zsmalloc encodes object's location directly so it
      makes support of migration hard.
      
      This patch decouples handle and object via adding indirect layer.  For
      that, it allocates handle dynamically and returns it to user.  The handle
      is the address allocated by slab allocation so it's unique and we could
      keep object's location in the memory space allocated for handle.
      
      With it, we can change object's position without changing handle itself.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e40e163
    • A
      mm/compaction.c: fix "suitable_migration_target() unused" warning · 018e9a49
      Andrew Morton 提交于
      mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]
      Reported-by: NFengguang Wu <fengguang.wu@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      018e9a49
    • B
      mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP · dd906184
      Boaz Harrosh 提交于
      This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
      get notified when access is a write to a read-only PFN.
      
      This can happen if we mmap() a file then first mmap-read from it to
      page-in a read-only PFN, than we mmap-write to the same page.
      
      We need this functionality to fix a DAX bug, where in the scenario above
      we fail to set ctime/mtime though we modified the file.  An xfstest is
      attached to this patchset that shows the failure and the fix.  (A DAX
      patch will follow)
      
      This functionality is extra important for us, because upon dirtying of a
      pmem page we also want to RDMA the page to a remote cluster node.
      
      We define a new pfn_mkwrite and do not reuse page_mkwrite because
        1 - The name ;-)
        2 - But mainly because it would take a very long and tedious
            audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
            users. To make sure they do not now CRASH. For example current
            DAX code (which this is for) would crash.
            If we would want to reuse page_mkwrite, We will need to first
            patch all users, so to not-crash-on-no-page. Then enable this
            patch. But even if I did that I would not sleep so well at night.
            Adding a new vector is the safest thing to do, and is not that
            expensive. an extra pointer at a static function vector per driver.
            Also the new vector is better for performance, because else we
            Will call all current Kernel vectors, so to:
              check-ha-no-page-do-nothing and return.
      
      No need to call it from do_shared_fault because do_wp_page is called to
      change pte permissions anyway.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd906184
    • K
      mm/memory: also print a_ops->readpage in print_bad_pte() · 2682582a
      Konstantin Khlebnikov 提交于
      A lot of filesystems use generic_file_mmap() and filemap_fault(),
      f_op->mmap and vm_ops->fault aren't enough to identify filesystem.
      
      This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
      (which is almost always implemented and filesystem-specific).
      
      Example:
      
      [   23.676410] BUG: Bad page map in process sh  pte:1b7e6025 pmd:19bbd067
      [   23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
      [   23.677481] flags: 0x10000000000000c(referenced|uptodate)
      [   23.677896] page dumped because: bad pte
      [   23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma:          (null) mapping:ffff8800196426c0 index:97
      [   23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage
      
      [akpm@linux-foundation.org: use pr_alert, per Kirill]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2682582a
    • A
      mm/mempool.c: kasan: poison mempool elements · 92393615
      Andrey Ryabinin 提交于
      Mempools keep allocated objects in reserved for situations when ordinary
      allocation may not be possible to satisfy.  These objects shouldn't be
      accessed before they leave the pool.
      
      This patch poison elements when get into the pool and unpoison when they
      leave it.  This will let KASan to detect use-after-free of mempool's
      elements.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Chernenkov <drcheren@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92393615
    • A
      mm/cma_debug.c: remove blank lines before DEFINE_SIMPLE_ATTRIBUTE() · bda6d330
      Andrew Morton 提交于
      Like EXPORT_SYMBOL(): the positioning communicates that the macro pertains
      to the immediately preceding function.
      
      Cc: Dmitry Safonov <d.safonov@partner.samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Stefan Strogin <stefan.strogin@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Cc: Vyacheslav Tyrtov <v.tyrtov@samsung.com>
      Cc: Aleksei Mateosian <a.mateosian@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda6d330
    • D
      mm: cma: add functions to get region pages counters · 2e32b947
      Dmitry Safonov 提交于
      Here are two functions that provide interface to compute/get used size and
      size of biggest free chunk in cma region.  Add that information to
      debugfs.
      
      [akpm@linux-foundation.org: move debug code from cma.c into cma_debug.c]
      [stefan.strogin@gmail.com: move code from cma_get_used() and cma_get_maxchunk() to cma_used_get() and cma_maxchunk_get()]
      Signed-off-by: NDmitry Safonov <d.safonov@partner.samsung.com>
      Signed-off-by: NStefan Strogin <stefan.strogin@gmail.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Cc: Vyacheslav Tyrtov <v.tyrtov@samsung.com>
      Cc: Aleksei Mateosian <a.mateosian@samsung.com>
      Signed-off-by: NStefan Strogin <stefan.strogin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e32b947
    • K
      thp: cleanup khugepaged startup · 79553da2
      Kirill A. Shutemov 提交于
      Few trivial cleanups:
      
       - no need to call set_recommended_min_free_kbytes() from
         late_initcall() -- start_khugepaged() calls it;
      
       - no need to call set_recommended_min_free_kbytes() from
         start_khugepaged() if khugepaged is not started;
      
       - there isn't much point in running start_khugepaged() if we've just
         set transparent_hugepage_flags to zero;
      
       - start_khugepaged() is misnamed -- it also used to stop the thread;
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79553da2
    • K
      mm: uninline and cleanup page-mapping related helpers · e39155ea
      Kirill A. Shutemov 提交于
      Most-used page->mapping helper -- page_mapping() -- has already uninlined.
       Let's uninline also page_rmapping() and page_anon_vma().  It saves us
      depending on configuration around 400 bytes in text:
      
         text	   data	    bss	    dec	    hex	filename
       660318	  99254	 410000	1169572	 11d8a4	mm/built-in.o-before
       659854	  99254	 410000	1169108	 11d6d4	mm/built-in.o
      
      I also tried to make code a bit more clean.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e39155ea
    • S
      mm: cma: add trace events for CMA allocations and freeings · 99e8ea6c
      Stefan Strogin 提交于
      Add trace events for cma_alloc() and cma_release().
      
      The cma_alloc tracepoint is used both for successful and failed allocations,
      in case of allocation failure pfn=-1UL is stored and printed.
      Signed-off-by: NStefan Strogin <stefan.strogin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mpn@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99e8ea6c
    • A
      mm/memblock.c: add debug output for memblock_add() · 6a4055bc
      Alexander Kuleshov 提交于
      memblock_reserve() calls memblock_reserve_region() which prints debugging
      information if 'memblock=debug' was passed on the command line.  This
      patch adds the same behaviour, but for memblock_add function().
      
      [akpm@linux-foundation.org: s/memblock_memory/memblock_add/ in message]
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Emil Medve <Emilian.Medve@freescale.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a4055bc
    • N
      mm: hugetlb: cleanup using paeg_huge_active() · 7e1f049e
      Naoya Horiguchi 提交于
      Now we have an easy access to hugepages' activeness, so existing helpers to
      get the information can be cleaned up.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e1f049e
    • N
      mm: hugetlb: introduce page_huge_active · bcc54222
      Naoya Horiguchi 提交于
      We are not safe from calling isolate_huge_page() on a hugepage
      concurrently, which can make the victim hugepage in invalid state and
      results in BUG_ON().
      
      The root problem of this is that we don't have any information on struct
      page (so easily accessible) about hugepages' activeness.  Note that
      hugepages' activeness means just being linked to
      hstate->hugepage_activelist, which is not the same as normal pages'
      activeness represented by PageActive flag.
      
      Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
      before isolation, so let's do similarly for hugetlb with a new
      paeg_huge_active().
      
      set/clear_page_huge_active() should be called within hugetlb_lock.  But
      hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
      in these functions set_page_huge_active() is called right after the
      hugepage is allocated and no other thread tries to isolate it.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
      [fengguang.wu@intel.com: set_page_huge_active() can be static]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcc54222
    • N
      mm: don't call __page_cache_release for hugetlb · 822fc613
      Naoya Horiguchi 提交于
      __put_compound_page() calls __page_cache_release() to do some freeing
      work, but it's obviously for thps, not for hugetlb.  We don't care because
      PageLRU is always cleared and page->mem_cgroup is always NULL for hugetlb.
      But it's not correct and has potential risks, so let's make it
      conditional.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      822fc613
    • R
      mm/mmap.c: use while instead of if+goto · 9fcd1457
      Rasmus Villemoes 提交于
      The creators of the C language gave us the while keyword. Let's use
      that instead of synthesizing it from if+goto.
      
      Made possible by 6597d783 ("mm/mmap.c: replace find_vma_prepare()
      with clearer find_vma_links()").
      
      [akpm@linux-foundation.org: fix 80-col overflows]
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fcd1457
    • K
      thp: do not adjust zone water marks if khugepaged is not started · ae7efa50
      Kirill A. Shutemov 提交于
      set_recommended_min_free_kbytes() adjusts zone water marks to be suitable
      for khugepaged. We avoid doing this if khugepaged is disabled, but don't
      catch the case when khugepaged is failed to start.
      
      Let's address this by checking khugepaged_thread instead of
      khugepaged_enabled() in set_recommended_min_free_kbytes().
      It's NULL if the kernel thread is stopped or failed to start.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae7efa50
    • K
      thp: handle errors in hugepage_init() properly · 65ebb64f
      Kirill A. Shutemov 提交于
      We miss error-handling in few cases hugepage_init(). Let's fix that.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65ebb64f
    • D
      mm, mempool: poison elements backed by slab allocator · bdfedb76
      David Rientjes 提交于
      Mempools keep elements in a reserved pool for contexts in which allocation
      may not be possible.  When an element is allocated from the reserved pool,
      its memory contents is the same as when it was added to the reserved pool.
      
      Because of this, elements lack any free poisoning to detect use-after-free
      errors.
      
      This patch adds free poisoning for elements backed by the slab allocator.
      This is possible because the mempool layer knows the object size of each
      element.
      
      When an element is added to the reserved pool, it is poisoned with
      POISON_FREE.  When it is removed from the reserved pool, the contents are
      checked for POISON_FREE.  If there is a mismatch, a warning is emitted to
      the kernel log.
      
      This is only effective for configs with CONFIG_DEBUG_SLAB or
      CONFIG_SLUB_DEBUG_ON.
      
      [fabio.estevam@freescale.com: use '%zu' for printing 'size_t' variable]
      [arnd@arndb.de: add missing include]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NFabio Estevam <fabio.estevam@freescale.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdfedb76
    • D
      mm, mempool: disallow mempools based on slab caches with constructors · e244c9e6
      David Rientjes 提交于
      All occurrences of mempools based on slab caches with object constructors
      have been removed from the tree, so disallow creating them.
      
      We can only dereference mem->ctor in mm/mempool.c without including
      mm/slab.h in include/linux/mempool.h.  So simply note the restriction,
      just like the comment restricting usage of __GFP_ZERO, and warn on kernels
      with CONFIG_DEBUG_VM() if such a mempool is allocated from.
      
      We don't want to incur this check on every element allocation, so use
      VM_BUG_ON().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e244c9e6
反馈
建议
客服 返回
顶部