1. 16 4月, 2015 40 次提交
    • H
      zsmalloc: fix fatal corruption due to wrong size class selection · 81da9b13
      Heesub Shin 提交于
      There is no point in overriding the size class below.  It causes fatal
      corruption on the next chunk on the 3264-bytes size class, which is the
      last size class that is not huge.
      
      For example, if the requested size was exactly 3264 bytes, current
      zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
      not 4096.  User access to this chunk may overwrite head of the next
      adjacent chunk.
      
      Here is the panic log captured when freelist was corrupted due to this:
      
          Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
          Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
          Modules linked in:
          exynos-snapshot: core register saved(CPU:5)
          CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
          exynos-snapshot: context saved(CPU:5)
          exynos-snapshot: item - log_kevents is disabled
          CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
          task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
          PC is at obj_idx_to_offset+0x0/0x1c
          LR is at obj_malloc+0x44/0xe8
          pc : [<ffffffc00030659c>] lr : [<ffffffc000306604>] pstate: a0000045
          sp : ffffffc0b71eb790
          x29: ffffffc0b71eb790 x28: ffffffc00204c000
          x27: 000000000001d96f x26: 0000000000000000
          x25: ffffffc098cc3500 x24: ffffffc0a13f2810
          x23: ffffffc098cc3501 x22: ffffffc0a13f2800
          x21: 000011e1a02006e3 x20: ffffffc0a13f2800
          x19: ffffffbc02a7e000 x18: 0000000000000000
          x17: 0000000000000000 x16: 0000000000000feb
          x15: 0000000000000000 x14: 00000000a01003e3
          x13: 0000000000000020 x12: fffffffffffffff0
          x11: ffffffc08b264000 x10: 00000000e3a01004
          x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
          x7 : ffffffc000307d24 x6 : 0000000000000000
          x5 : 0000000000000038 x4 : 000000000000011e
          x3 : ffffffbc00003e90 x2 : 0000000000000cc0
          x1 : 00000000d0100371 x0 : ffffffbc00003e90
      Reported-by: NSooyong Suk <s.suk@samsung.com>
      Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
      Tested-by: NSooyong Suk <s.suk@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81da9b13
    • M
      zsmalloc: remove unnecessary insertion/removal of zspage in compaction · 839373e6
      Minchan Kim 提交于
      In putback_zspage, we don't need to insert a zspage into list of zspage
      in size_class again to just fix fullness group. We could do directly
      without reinsertion so we could save some instuctions.
      Reported-by: NHeesub Shin <heesub.shin@samsung.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Juneho Choi <juno.choi@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      839373e6
    • S
      zsmalloc: micro-optimize zs_object_copy() · 495819ea
      Sergey Senozhatsky 提交于
      A micro-optimization.  Avoid additional branching and reduce (a bit)
      registry pressure (f.e.  s_off += size; d_off += size; may be calculated
      twise: first for >= PAGE_SIZE check and later for offset update in "else"
      clause).
      
      scripts/bloat-o-meter shows some improvement
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
      function                          old     new   delta
      zs_object_copy                    550     540     -10
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      495819ea
    • S
      zsmalloc: remove synchronize_rcu from zs_compact() · 1ec7cfb1
      Sergey Senozhatsky 提交于
      Do not synchronize rcu in zs_compact(). Neither zsmalloc not
      zram use rcu.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ec7cfb1
    • S
      zram: deprecate zram attrs sysfs nodes · 8f7d282c
      Sergey Senozhatsky 提交于
      Add Documentation/ABI/obsolete/sysfs-block-zram file and list obsolete and
      deprecated attributes there.  The patch also adds additional information
      to zram documentation and describes the basic strategy:
      
      - the existing RW nodes will be downgraded to WO nodes (in 4.11)
      - deprecated RO sysfs nodes will eventually be removed (in 4.11)
      
      Users will be additionally notified about deprecated attr usage by
      pr_warn_once() (added to every deprecated attr _show()), as suggested by
      Minchan Kim.
      
      User space is advised to use zram<id>/stat, zram<id>/io_stat and
      zram<id>/mm_stat files.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f7d282c
    • S
      zram: export new 'mm_stat' sysfs attrs · 4f2109f6
      Sergey Senozhatsky 提交于
      Per-device `zram<id>/mm_stat' file provides mm statistics of a particular
      zram device in a format similar to block layer statistics.  The file
      consists of a single line and represents the following stats (separated by
      whitespace):
      
              orig_data_size
              compr_data_size
              mem_used_total
              mem_limit
              mem_used_max
              zero_pages
              num_migrated
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f2109f6
    • S
      zram: export new 'io_stat' sysfs attrs · 2f6a3bed
      Sergey Senozhatsky 提交于
      Per-device `zram<id>/io_stat' file provides accumulated I/O statistics of
      particular zram device in a format similar to block layer statistics.  The
      file consists of a single line and represents the following stats
      (separated by whitespace):
      
              failed_reads
              failed_writes
              invalid_io
              notify_free
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f6a3bed
    • S
      zram: describe device attrs in documentation · 77ba015f
      Sergey Senozhatsky 提交于
      Briefly describe exported device stat attrs in zram documentation.  We
      will eventually get rid of per-stat sysfs nodes and, thus, clean up
      Documentation/ABI/testing/sysfs-block-zram file, which is the only source
      of information about device sysfs nodes.
      
      Add `num_migrated' description, since there is no independent
      `num_migrated' sysfs node (and no corresponding sysfs-block-zram entry),
      it will be exported via zram<id>/mm_stat file.
      
      At this point we can provide minimal description, because sysfs-block-zram
      still contains detailed information.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77ba015f
    • S
      zram: use generic start/end io accounting · 8811a942
      Sergey Senozhatsky 提交于
      Use bio generic_start_io_acct() and generic_end_io_acct() to account
      device's block layer statistics.  This will let users to monitor zram
      activities using sysstat and similar packages/tools.
      
      Apart from the usual per-stat sysfs attr, zram IO stats are now also
      available in '/sys/block/zram<id>/stat' and '/proc/diskstats' files.
      
      We will slowly get rid of per-stat sysfs files.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8811a942
    • S
      zram: move compact_store() to sysfs functions area · c72c6160
      Sergey Senozhatsky 提交于
      A cosmetic change.  We have a new code layout and keep zram per-device
      sysfs store and show functions in one place.  Move compact_store() to that
      handlers block to conform to current layout.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c72c6160
    • S
      zram: remove `num_migrated' device attr · 10447b60
      Sergey Senozhatsky 提交于
      This patch introduces rework to zram stats.  We have per-stat sysfs nodes,
      and it makes things a bit hard to use in user space: it doesn't give an
      immediate stats 'snapshot', it requires user space to use more syscalls -
      open, read, close for every stat file, with appropriate error checks on
      every step, etc.
      
      First, zram now accounts block layer statistics, available in
      /sys/block/zram<id>/stat and /proc/diskstats files.  So some new stats are
      available (see Documentation/block/stat.txt), besides, zram's activities
      now can be monitored by sysstat's iostat or similar tools.
      
      Example:
      cat /sys/block/zram0/stat
      248     0    1984    0   251029     0  2008232   5120   0   5116   5116
      
      Second, group currently exported on per-stat basis nodes into two
      categories (files):
      
      -- zram<id>/io_stat
      accumulates device's IO stats, that are not accounted by block layer,
      and contains:
              failed_reads
              failed_writes
              invalid_io
              notify_free
      
      Example:
      cat /sys/block/zram0/io_stat
      0        0        0   652572
      
      -- zram<id>/mm_stat
      accumulates zram mm stats and contains:
              orig_data_size
              compr_data_size
              mem_used_total
              mem_limit
              mem_used_max
              zero_pages
              num_migrated
      
      Example:
      cat /sys/block/zram0/mm_stat
      434634752 270288572 279158784        0 579895296    15060        0
      
      per-stat sysfs nodes are now considered to be deprecated and we plan to
      remove them (and clean up some of the existing stat code) in two years (as
      of now, there is no warning printed to syslog about deprecated stats being
      used).  User space is advised to use the above mentioned 3 files.
      
      This patch (of 7):
      
      Remove sysfs `num_migrated' attribute.  We are moving away from per-stat
      device attrs towards 3 stat files that will accumulate io and mm stats in
      a format similar to block layer statistics in /sys/block/<dev>/stat.  That
      will be easier to use in user space, and reduce the number of syscalls
      needed to read zram device statistics.
      
      `num_migrated' will return back in zram<id>/mm_stat file.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10447b60
    • Y
    • M
      zsmalloc: zsmalloc documentation · d02be50d
      Minchan Kim 提交于
      Create zsmalloc doc which explains design concept and stat information.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02be50d
    • M
      zsmalloc: add fullness into stat · 248ca1b0
      Minchan Kim 提交于
      During investigating compaction, fullness information of each class is
      helpful for investigating how the compaction works well.  With that, we
      could know how compaction works well more clear on each size class.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      248ca1b0
    • M
      zsmalloc: record handle in page->private for huge object · 7b60a685
      Minchan Kim 提交于
      We store handle on header of each allocated object so it increases the
      size of each object by sizeof(unsigned long).
      
      If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
      4104B-class to add handle.
      
      However, 4104B-class has 1-pages_per_zspage so wasted size by internal
      fragment is 8192 - 4104, which is terrible.
      
      So this patch records the handle in page->private on such huge object(ie,
      pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
      object so we could use 4096B-class, not 4104B-class.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b60a685
    • M
      zram: support compaction · 4e3ba878
      Minchan Kim 提交于
      Now that zsmalloc supports compaction, zram can use it.  For the first
      step, this patch exports compact knob via sysfs so user can do compaction
      via "echo 1 > /sys/block/zram0/compact".
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e3ba878
    • M
      zsmalloc: adjust ZS_ALMOST_FULL · d3d07c92
      Minchan Kim 提交于
      Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
      under 1/4 used objects(ie, fullness_threshold_frac).  It could make result
      in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.
      
      This patch changes the rule so that zsmalloc makes zspage which has above
      3/4 used object ZS_ALMOST_FULL so it could make tight packing.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3d07c92
    • M
      zsmalloc: support compaction · 312fcae2
      Minchan Kim 提交于
      This patch provides core functions for migration of zsmalloc.  Migraion
      policy is simple as follows.
      
      for each size class {
              while {
                      src_page = get zs_page from ZS_ALMOST_EMPTY
                      if (!src_page)
                              break;
                      dst_page = get zs_page from ZS_ALMOST_FULL
                      if (!dst_page)
                              dst_page = get zs_page from ZS_ALMOST_EMPTY
                      if (!dst_page)
                              break;
                      migrate(from src_page, to dst_page);
              }
      }
      
      For migration, we need to identify which objects in zspage are allocated
      to migrate them out.  We could know it by iterating of freed objects in a
      zspage because first_page of zspage keeps free objects singly-linked list
      but it's not efficient.  Instead, this patch adds a tag(ie,
      OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
      whether the object is allocated easily.
      
      This patch adds another status bit in handle to synchronize between user
      access through zs_map_object and migration.  During migration, we cannot
      move objects user are using due to data coherency between old object and
      new object.
      
      [akpm@linux-foundation.org: zsmalloc.c needs sched.h for cond_resched()]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      312fcae2
    • M
      zsmalloc: factor out obj_[malloc|free] · c7806261
      Minchan Kim 提交于
      In later patch, migration needs some part of functions in zs_malloc and
      zs_free so this patch factor out them.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7806261
    • M
      zsmalloc: decouple handle and object · 2e40e163
      Minchan Kim 提交于
      Recently, we started to use zram heavily and some of issues
      popped.
      
      1) external fragmentation
      
      I got a report from Juneho Choi that fork failed although there are plenty
      of free pages in the system.  His investigation revealed zram is one of
      the culprit to make heavy fragmentation so there was no more contiguous
      16K page for pgd to fork in the ARM.
      
      2) non-movable pages
      
      Other problem of zram now is that inherently, user want to use zram as
      swap in small memory system so they use zRAM with CMA to use memory
      efficiently.  However, unfortunately, it doesn't work well because zRAM
      cannot use CMA's movable pages unless it doesn't support compaction.  I
      got several reports about that OOM happened with zram although there are
      lots of swap space and free space in CMA area.
      
      3) internal fragmentation
      
      zRAM has started support memory limitation feature to limit memory usage
      and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
      harmonized with zram-swap to stop anonymous page reclaim if zram consumed
      memory up to the limit although there are free space on the swap.  One
      problem for that direction is zram has no way to know any hole in memory
      space zsmalloc allocated by internal fragmentation so zram would regard
      swap is full although there are free space in zsmalloc.  For solving the
      issue, zram want to trigger compaction of zsmalloc before it decides full
      or not.
      
      This patchset is first step to support above issues.  For that, it adds
      indirect layer between handle and object location and supports manual
      compaction to solve 3th problem first of all.
      
      After this patchset got merged, next step is to make VM aware of zsmalloc
      compaction so that generic compaction will move zsmalloced-pages
      automatically in runtime.
      
      In my imaginary experiment(ie, high compress ratio data with heavy swap
      in/out on 8G zram-swap), data is as follows,
      
      Before =
      zram allocated object :      60212066 bytes
      zram total used:     140103680 bytes
      ratio:         42.98 percent
      MemFree:          840192 kB
      
      Compaction
      
      After =
      frag ratio after compaction
      zram allocated object :      60212066 bytes
      zram total used:      76185600 bytes
      ratio:         79.03 percent
      MemFree:          901932 kB
      
      Juneho reported below in his real platform with small aging.
      So, I think the benefit would be bigger in real aging system
      for a long time.
      
      - frag_ratio increased 3% (ie, higher is better)
      - memfree increased about 6MB
      - In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased
      
      frag ratio after swap fragment
      used :        156677 kbytes
      total:        166092 kbytes
      frag_ratio :  94
      meminfo before compaction
      MemFree:           83724 kB
      Node 0, zone   Normal  13642   1364     57     10     61     17      9      5      4      0      0
      Node 0, zone  HighMem    425     29      1      0      0      0      0      0      0      0      0
      
      num_migrated :  23630
      compaction done
      
      frag ratio after compaction
      used :        156673 kbytes
      total:        160564 kbytes
      frag_ratio :  97
      meminfo after compaction
      MemFree:           89060 kB
      Node 0, zone   Normal  14076   1544     67     14     61     17      9      5      4      0      0
      Node 0, zone  HighMem    863     50      1      0      0      0      0      0      0      0      0
      
      This patchset adds more logics(about 480 lines) in zsmalloc but when I
      tested heavy swapin/out program, the regression for swapin/out speed is
      marginal because most of overheads were caused by compress/decompress and
      other MM reclaim stuff.
      
      This patch (of 7):
      
      Currently, handle of zsmalloc encodes object's location directly so it
      makes support of migration hard.
      
      This patch decouples handle and object via adding indirect layer.  For
      that, it allocates handle dynamically and returns it to user.  The handle
      is the address allocated by slab allocation so it's unique and we could
      keep object's location in the memory space allocated for handle.
      
      With it, we can change object's position without changing handle itself.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Juneho Choi <juno.choi@lge.com>
      Cc: Gunho Lee <gunho.lee@lge.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e40e163
    • A
      mm/compaction.c: fix "suitable_migration_target() unused" warning · 018e9a49
      Andrew Morton 提交于
      mm/compaction.c:250:13: warning: 'suitable_migration_target' defined but not used [-Wunused-function]
      Reported-by: NFengguang Wu <fengguang.wu@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      018e9a49
    • B
      dax: unify ext2/4_{dax,}_file_operations · be64f884
      Boaz Harrosh 提交于
      The original dax patchset split the ext2/4_file_operations because of the
      two NULL splice_read/splice_write in the dax case.
      
      In the vfs if splice_read/splice_write are NULL we then call
      default_splice_read/write.
      
      What we do here is make generic_file_splice_read aware of IS_DAX() so the
      original ext2/4_file_operations can be used as is.
      
      For write it appears that iter_file_splice_write is just fine.  It uses
      the regular f_op->write(file,..) or new_sync_write(file, ...).
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be64f884
    • B
      dax: use pfn_mkwrite to update c/mtime + freeze protection · 0e3b210c
      Boaz Harrosh 提交于
      From: Yigal Korman <yigal@plexistor.com>
      
      [v1]
      Without this patch, c/mtime is not updated correctly when mmap'ed page is
      first read from and then written to.
      
      A new xfstest is submitted for testing this (generic/080)
      
      [v2]
      Jan Kara has pointed out that if we add the
      sb_start/end_pagefault pair in the new pfn_mkwrite we
      are then fixing another bug where: A user could start
      writing to the page while filesystem is frozen.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e3b210c
    • B
      mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP · dd906184
      Boaz Harrosh 提交于
      This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
      get notified when access is a write to a read-only PFN.
      
      This can happen if we mmap() a file then first mmap-read from it to
      page-in a read-only PFN, than we mmap-write to the same page.
      
      We need this functionality to fix a DAX bug, where in the scenario above
      we fail to set ctime/mtime though we modified the file.  An xfstest is
      attached to this patchset that shows the failure and the fix.  (A DAX
      patch will follow)
      
      This functionality is extra important for us, because upon dirtying of a
      pmem page we also want to RDMA the page to a remote cluster node.
      
      We define a new pfn_mkwrite and do not reuse page_mkwrite because
        1 - The name ;-)
        2 - But mainly because it would take a very long and tedious
            audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
            users. To make sure they do not now CRASH. For example current
            DAX code (which this is for) would crash.
            If we would want to reuse page_mkwrite, We will need to first
            patch all users, so to not-crash-on-no-page. Then enable this
            patch. But even if I did that I would not sleep so well at night.
            Adding a new vector is the safest thing to do, and is not that
            expensive. an extra pointer at a static function vector per driver.
            Also the new vector is better for performance, because else we
            Will call all current Kernel vectors, so to:
              check-ha-no-page-do-nothing and return.
      
      No need to call it from do_shared_fault because do_wp_page is called to
      change pte permissions anyway.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd906184
    • K
      mm/memory: also print a_ops->readpage in print_bad_pte() · 2682582a
      Konstantin Khlebnikov 提交于
      A lot of filesystems use generic_file_mmap() and filemap_fault(),
      f_op->mmap and vm_ops->fault aren't enough to identify filesystem.
      
      This prints file name, vm_ops->fault, f_op->mmap and a_ops->readpage
      (which is almost always implemented and filesystem-specific).
      
      Example:
      
      [   23.676410] BUG: Bad page map in process sh  pte:1b7e6025 pmd:19bbd067
      [   23.676887] page:ffffea00006df980 count:4 mapcount:1 mapping:ffff8800196426c0 index:0x97
      [   23.677481] flags: 0x10000000000000c(referenced|uptodate)
      [   23.677896] page dumped because: bad pte
      [   23.678205] addr:00007f52fcb17000 vm_flags:00000075 anon_vma:          (null) mapping:ffff8800196426c0 index:97
      [   23.678922] file:libc-2.19.so fault:filemap_fault mmap:generic_file_readonly_mmap readpage:v9fs_vfs_readpage
      
      [akpm@linux-foundation.org: use pr_alert, per Kirill]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2682582a
    • A
      mm/mempool.c: kasan: poison mempool elements · 92393615
      Andrey Ryabinin 提交于
      Mempools keep allocated objects in reserved for situations when ordinary
      allocation may not be possible to satisfy.  These objects shouldn't be
      accessed before they leave the pool.
      
      This patch poison elements when get into the pool and unpoison when they
      leave it.  This will let KASan to detect use-after-free of mempool's
      elements.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Chernenkov <drcheren@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92393615
    • A
      mm/cma_debug.c: remove blank lines before DEFINE_SIMPLE_ATTRIBUTE() · bda6d330
      Andrew Morton 提交于
      Like EXPORT_SYMBOL(): the positioning communicates that the macro pertains
      to the immediately preceding function.
      
      Cc: Dmitry Safonov <d.safonov@partner.samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Stefan Strogin <stefan.strogin@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Cc: Vyacheslav Tyrtov <v.tyrtov@samsung.com>
      Cc: Aleksei Mateosian <a.mateosian@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bda6d330
    • D
      mm: cma: add functions to get region pages counters · 2e32b947
      Dmitry Safonov 提交于
      Here are two functions that provide interface to compute/get used size and
      size of biggest free chunk in cma region.  Add that information to
      debugfs.
      
      [akpm@linux-foundation.org: move debug code from cma.c into cma_debug.c]
      [stefan.strogin@gmail.com: move code from cma_get_used() and cma_get_maxchunk() to cma_used_get() and cma_maxchunk_get()]
      Signed-off-by: NDmitry Safonov <d.safonov@partner.samsung.com>
      Signed-off-by: NStefan Strogin <stefan.strogin@gmail.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Cc: Vyacheslav Tyrtov <v.tyrtov@samsung.com>
      Cc: Aleksei Mateosian <a.mateosian@samsung.com>
      Signed-off-by: NStefan Strogin <stefan.strogin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e32b947
    • K
      thp: cleanup khugepaged startup · 79553da2
      Kirill A. Shutemov 提交于
      Few trivial cleanups:
      
       - no need to call set_recommended_min_free_kbytes() from
         late_initcall() -- start_khugepaged() calls it;
      
       - no need to call set_recommended_min_free_kbytes() from
         start_khugepaged() if khugepaged is not started;
      
       - there isn't much point in running start_khugepaged() if we've just
         set transparent_hugepage_flags to zero;
      
       - start_khugepaged() is misnamed -- it also used to stop the thread;
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79553da2
    • K
      mm: uninline and cleanup page-mapping related helpers · e39155ea
      Kirill A. Shutemov 提交于
      Most-used page->mapping helper -- page_mapping() -- has already uninlined.
       Let's uninline also page_rmapping() and page_anon_vma().  It saves us
      depending on configuration around 400 bytes in text:
      
         text	   data	    bss	    dec	    hex	filename
       660318	  99254	 410000	1169572	 11d8a4	mm/built-in.o-before
       659854	  99254	 410000	1169108	 11d6d4	mm/built-in.o
      
      I also tried to make code a bit more clean.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e39155ea
    • S
      mm: cma: add trace events for CMA allocations and freeings · 99e8ea6c
      Stefan Strogin 提交于
      Add trace events for cma_alloc() and cma_release().
      
      The cma_alloc tracepoint is used both for successful and failed allocations,
      in case of allocation failure pfn=-1UL is stored and printed.
      Signed-off-by: NStefan Strogin <stefan.strogin@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mpn@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Cc: Thierry Reding <treding@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99e8ea6c
    • B
      include/linux/mm.h: simplify flag check · cdd7875e
      Borislav Petkov 提交于
      Flip the flag test so that it is the simplest.  No functional change, just
      a small readability improvement:
      
      No code changed:
      
        # arch/x86/kernel/sys_x86_64.o:
      
         text    data     bss     dec     hex filename
         1551      24       0    1575     627 sys_x86_64.o.before
         1551      24       0    1575     627 sys_x86_64.o.after
      
      md5:
         70708d1b1ad35cc891118a69dc1a63f9  sys_x86_64.o.before.asm
         70708d1b1ad35cc891118a69dc1a63f9  sys_x86_64.o.after.asm
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdd7875e
    • A
      mm/memblock.c: add debug output for memblock_add() · 6a4055bc
      Alexander Kuleshov 提交于
      memblock_reserve() calls memblock_reserve_region() which prints debugging
      information if 'memblock=debug' was passed on the command line.  This
      patch adds the same behaviour, but for memblock_add function().
      
      [akpm@linux-foundation.org: s/memblock_memory/memblock_add/ in message]
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Emil Medve <Emilian.Medve@freescale.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a4055bc
    • N
      mm: hugetlb: cleanup using paeg_huge_active() · 7e1f049e
      Naoya Horiguchi 提交于
      Now we have an easy access to hugepages' activeness, so existing helpers to
      get the information can be cleaned up.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e1f049e
    • N
      mm: hugetlb: introduce page_huge_active · bcc54222
      Naoya Horiguchi 提交于
      We are not safe from calling isolate_huge_page() on a hugepage
      concurrently, which can make the victim hugepage in invalid state and
      results in BUG_ON().
      
      The root problem of this is that we don't have any information on struct
      page (so easily accessible) about hugepages' activeness.  Note that
      hugepages' activeness means just being linked to
      hstate->hugepage_activelist, which is not the same as normal pages'
      activeness represented by PageActive flag.
      
      Normal pages are isolated by isolate_lru_page() which prechecks PageLRU
      before isolation, so let's do similarly for hugetlb with a new
      paeg_huge_active().
      
      set/clear_page_huge_active() should be called within hugetlb_lock.  But
      hugetlb_cow() and hugetlb_no_page() don't do this, being justified because
      in these functions set_page_huge_active() is called right after the
      hugepage is allocated and no other thread tries to isolate it.
      
      [akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/, make it return bool]
      [fengguang.wu@intel.com: set_page_huge_active() can be static]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bcc54222
    • N
      mm: don't call __page_cache_release for hugetlb · 822fc613
      Naoya Horiguchi 提交于
      __put_compound_page() calls __page_cache_release() to do some freeing
      work, but it's obviously for thps, not for hugetlb.  We don't care because
      PageLRU is always cleared and page->mem_cgroup is always NULL for hugetlb.
      But it's not correct and has potential risks, so let's make it
      conditional.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      822fc613
    • R
      mm/mmap.c: use while instead of if+goto · 9fcd1457
      Rasmus Villemoes 提交于
      The creators of the C language gave us the while keyword. Let's use
      that instead of synthesizing it from if+goto.
      
      Made possible by 6597d783 ("mm/mmap.c: replace find_vma_prepare()
      with clearer find_vma_links()").
      
      [akpm@linux-foundation.org: fix 80-col overflows]
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fcd1457
    • D
      mm, selftests: test return value of munmap for MAP_HUGETLB memory · 215ba781
      David Rientjes 提交于
      When MAP_HUGETLB memory is unmapped, the length must be hugepage aligned,
      otherwise it fails with -EINVAL.
      
      All tests currently behave correctly, but it's better to explcitly test
      the return value for completeness and document the requirement, especially
      if users copy map_hugetlb.c as a sample implementation.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      215ba781
    • D
      mm, doc: cleanup and clarify munmap behavior for hugetlb memory · 80d6b94b
      David Rientjes 提交于
      munmap(2) of hugetlb memory requires a length that is hugepage aligned,
      otherwise it may fail.  Add this to the documentation.
      
      This also cleans up the documentation and separates it into logical units:
      one part refers to MAP_HUGETLB and another part refers to requirements for
      shared memory segments.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80d6b94b
    • K
      thp: do not adjust zone water marks if khugepaged is not started · ae7efa50
      Kirill A. Shutemov 提交于
      set_recommended_min_free_kbytes() adjusts zone water marks to be suitable
      for khugepaged. We avoid doing this if khugepaged is disabled, but don't
      catch the case when khugepaged is failed to start.
      
      Let's address this by checking khugepaged_thread instead of
      khugepaged_enabled() in set_recommended_min_free_kbytes().
      It's NULL if the kernel thread is stopped or failed to start.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae7efa50