1. 17 12月, 2014 2 次提交
  2. 14 12月, 2014 38 次提交
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • T
      mm/cma: make kmemleak ignore CMA regions · 620951e2
      Thierry Reding 提交于
      kmemleak will add allocations as objects to a pool.  The memory allocated
      for each object in this pool is periodically searched for pointers to
      other allocated objects.  This only works for memory that is mapped into
      the kernel's virtual address space, which happens not to be the case for
      most CMA regions.
      
      Furthermore, CMA regions are typically used to store data transferred to
      or from a device and therefore don't contain pointers to other objects.
      
      Without this, the kernel crashes on the first execution of the
      scan_gray_list() because it tries to access highmem.  Perhaps a more
      appropriate fix would be to reject any object that can't map to a kernel
      virtual address?
      
      [akpm@linux-foundation.org: add comment]
      [akpm@linux-foundation.org: fix comment, per Catalin]
      [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      620951e2
    • V
      slub: fix cpuset check in get_any_partial · dee2f8aa
      Vladimir Davydov 提交于
      If we fail to allocate from the current node's stock, we look for free
      objects on other nodes before calling the page allocator (see
      get_any_partial).  While checking other nodes we respect cpuset
      constraints by calling cpuset_zone_allowed.  We enforce hardwall check.
      As a result, we will fallback to the page allocator even if there are some
      pages cached on other nodes, but the current cpuset doesn't have them set.
       However, the page allocator uses softwall check for kernel allocations,
      so it may allocate from one of the other nodes in this case.
      
      Therefore we should use softwall cpuset check in get_any_partial to
      conform with the cpuset check in the page allocator.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dee2f8aa
    • V
      slab: fix cpuset check in fallback_alloc · 061d7074
      Vladimir Davydov 提交于
      fallback_alloc is called on kmalloc if the preferred node doesn't have
      free or partial slabs and there's no pages on the node's free list
      (GFP_THISNODE allocations fail).  Before invoking the reclaimer it tries
      to locate a free or partial slab on other allowed nodes' lists.  While
      iterating over the preferred node's zonelist it skips those zones which
      hardwall cpuset check returns false for.  That means that for a task bound
      to a specific node using cpusets fallback_alloc will always ignore free
      slabs on other nodes and go directly to the reclaimer, which, however, may
      allocate from other nodes if cpuset.mem_hardwall is unset (default).  As a
      result, we may get lists of free slabs grow without bounds on other nodes,
      which is bad, because inactive slabs are only evicted by cache_reap at a
      very slow rate and cannot be dropped forcefully.
      
      To reproduce the issue, run a process that will walk over a directory tree
      with lots of files inside a cpuset bound to a node that constantly
      experiences memory pressure.  Look at num_slabs vs active_slabs growth as
      reported by /proc/slabinfo.
      
      To avoid this we should use softwall cpuset check in fallback_alloc.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      061d7074
    • H
      mm/zbud: init user ops only when it is needed · 1dd61aa3
      Heesub Shin 提交于
      When zbud is initialized through the zpool wrapper, pool->ops which
      points to user-defined operations is always set regardless of whether it
      is specified from the upper layer. This causes zbud_reclaim_page() to
      iterate its loop for evicting pool pages out without any gain.
      
      This patch sets the user-defined ops only when it is needed, so that
      zbud_reclaim_page() can bail out the reclamation loop earlier if there
      is no user-defined operations specified.
      Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sunae Seo <sunae.seo@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd61aa3
    • M
      mm/zswap: delete unnecessary check before calling free_percpu() · 442cc432
      Markus Elfring 提交于
      free_percpu() tests whether its argument is NULL and then returns
      immediately.  Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      442cc432
    • M
      mm/zswap: add __init to some functions in zswap · dd01d7d8
      Mahendran Ganesh 提交于
      zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by
      __init init_zswap()
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd01d7d8
    • G
      mm/zsmalloc: allocate exactly size of struct zs_pool · 18136656
      Ganesh Mahendran 提交于
      In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
        ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);
      
      This patch allocate memory of exactly needed size.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18136656
    • G
      mm/zsmalloc: avoid duplicate assignment of prev_class · df8b5bb9
      Ganesh Mahendran 提交于
      In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
      And the prev_class only references to the previous size_class.  So we do
      not need unnecessary assignement.
      
      This patch assigns *prev_class* when a new size_class structure is
      allocated and uses prev_class to check whether the first class has been
      allocated.
      
      [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df8b5bb9
    • M
      mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE · 40f9fb8c
      Mahendran Ganesh 提交于
      I sent a patch [1] for unnecessary check in zsmalloc.  And Minchan Kim
      found zsmalloc even does not support allocating an obj with the size of
      ZS_MAX_ALLOC_SIZE in some situations.
      
      For example:
         In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
         ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
            MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
         ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
         ZS_SIZE_CLASS_DELTA is 256 bytes
         So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
                                ZS_SIZE_CLASS_DELTA + 1
                             = 256
      
         In zs_create_pool(), the max size obj which can be allocated will be:
            ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312
      
         We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
         allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
         users we can do.
      
       [1]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
       [2]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html
      
      This patch fixes this issue by dynamiclly calculating zs_size_classes when
      module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE.  Then the
      max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.
      
      [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40f9fb8c
    • M
      zsmalloc: correct fragile [kmap|kunmap]_atomic use · af4ee5e9
      Minchan Kim 提交于
      The kunmap_atomic should use virtual address getting by kmap_atomic.
      However, some pieces of code in zsmalloc uses modified address, not the
      one got by kmap_atomic for kunmap_atomic.
      
      It's okay for working because zsmalloc modifies the address inner
      PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
      But it's still fragile with potential changing of kmap_atomic so let's
      correct it.
      
      I got a subtle bug when I implemented a new feature of zsmalloc
      (compaction) due to a link's mishandling (the link was over page
      boundary).  Although it was totally my mistake, it took a while to find
      the cause because an unpredictable kmapped address was unmapped causing an
      almost random crash.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af4ee5e9
    • S
      zsmalloc: fix zs_init cpu notifier error handling · b1b00a5b
      Sergey Senozhatsky 提交于
      Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
      zpool_unregister_driver() from zs_init() if cpu notifier registration has
      failed, because error handling is performed before we register the driver
      via zpool_register_driver() call.
      
      Factor out cpu notifier registration and unregistration code and fix
      zs_init() error handling.
      
      link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
      [akpm@linux-foundation.org: squash bogus gcc warning]
      [akpm@linux-foundation.org: use __init and __exit]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1b00a5b
    • J
      zsmalloc: merge size_class to reduce fragmentation · 9eec4cd5
      Joonsoo Kim 提交于
      zsmalloc has many size_classes to reduce fragmentation and they are in 16
      bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096.  And,
      zsmalloc has constraint that each zspage has 4 pages at maximum.
      
      In this situation, we can see interesting aspect.  Let's think about
      size_class for 1488, 1472, ..., 1376.  To prevent external fragmentation,
      they uses 4 pages per zspage and so all they can contain 11 objects at
      maximum.
      
      16384 (4096 * 4) = 1488 * 11 + remains
      16384 (4096 * 4) = 1472 * 11 + remains
      16384 (4096 * 4) = ...
      16384 (4096 * 4) = 1376 * 11 + remains
      
      It means that they have same characteristics and classification between
      them isn't needed.  If we use one size_class for them, we can reduce
      fragementation and save some memory since both the 1488 and 1472 sized
      classes can only fit 11 objects into 4 pages, and an object that's 1472
      bytes can fit into an object that's 1488 bytes, merging these classes to
      always use objects that are 1488 bytes will reduce the total number of
      size classes.  And reducing the total number of size classes reduces
      overall fragmentation, because a wider range of compressed pages can fit
      into a single size class, leaving less unused objects in each size class.
      
      For this purpose, this patch implement size_class merging.  If there is
      size_class that have same pages_per_zspage and same number of objects per
      zspage with previous size_class, we don't create new size_class.  Instead,
      we use previous, same characteristic size_class.  With this way, above
      example sizes (1488, 1472, ..., 1376) use just one size_class so we can
      get much more memory utilization.
      
      Below is result of my simple test.
      
      TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
      source code, remove directory in descending order in size.  (drivers arch
      fs sound include net Documentation firmware kernel tools)
      
      Each line represents orig_data_size, compr_data_size, mem_used_total,
      fragmentation overhead (mem_used - compr_data_size) and overhead ratio
      (overhead to compr_data_size), respectively, after untar and remove
      operation is executed.
      
      * untar-nomerge.out
      
      orig_size compr_size used_size overhead overhead_ratio
      525.88MB 199.16MB 210.23MB  11.08MB 5.56%
      288.32MB  97.43MB 105.63MB   8.20MB 8.41%
      177.32MB  61.12MB  69.40MB   8.28MB 13.55%
      146.47MB  47.32MB  56.10MB   8.78MB 18.55%
      124.16MB  38.85MB  48.41MB   9.55MB 24.58%
      103.93MB  31.68MB  40.93MB   9.25MB 29.21%
       84.34MB  22.86MB  32.72MB   9.86MB 43.13%
       66.87MB  14.83MB  23.83MB   9.00MB 60.70%
       60.67MB  11.11MB  18.60MB   7.49MB 67.48%
       55.86MB   8.83MB  16.61MB   7.77MB 88.03%
       53.32MB   8.01MB  15.32MB   7.31MB 91.24%
      
      * untar-merge.out
      
      orig_size compr_size used_size overhead overhead_ratio
      526.23MB 199.18MB 209.81MB  10.64MB 5.34%
      288.68MB  97.45MB 104.08MB   6.63MB 6.80%
      177.68MB  61.14MB  66.93MB   5.79MB 9.47%
      146.83MB  47.34MB  52.79MB   5.45MB 11.51%
      124.52MB  38.87MB  44.30MB   5.43MB 13.96%
      104.29MB  31.70MB  36.83MB   5.13MB 16.19%
       84.70MB  22.88MB  27.92MB   5.04MB 22.04%
       67.11MB  14.83MB  19.26MB   4.43MB 29.86%
       60.82MB  11.10MB  14.90MB   3.79MB 34.17%
       55.90MB   8.82MB  12.61MB   3.79MB 42.97%
       53.32MB   8.01MB  11.73MB   3.73MB 46.53%
      
      As you can see above result, merged one has better utilization (overhead
      ratio, 5th column) and uses less memory (mem_used_total, 3rd column).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: <juno.choi@lge.com>
      Cc: "seungho1.park" <seungho1.park@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9eec4cd5
    • R
      mm/memcontrol.c: remove unused mem_cgroup_lru_names_not_uptodate() · 70bc068c
      Rickard Strandqvist 提交于
      Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON()
      to the beginning of memcg_stat_show().
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70bc068c
    • V
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov 提交于
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
    • M
      mm/memcontrol.c: fix defined but not used compiler warning · ae6e71d3
      Michele Curti 提交于
      test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so
      move it into the compiler if statement
      
      [akpm@linux-foundation.org: clean up layout]
      Signed-off-by: NMichele Curti <michele.curti@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae6e71d3
    • M
      mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages · 441c228f
      Mel Gorman 提交于
      A random seek IO benchmark appeared to regress because of a change to
      readahead but the real problem was the benchmark.  To ensure the IO
      request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
      (512K) but the hint is ignored by the kernel.  This is correct but not
      necessarily obvious behaviour.  As much as I dislike comment patches, the
      explanation for this behaviour predates current git history.  Clarify why
      it behaves like this in case someone "fixes" fadvise or readahead for the
      wrong reasons.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      441c228f
    • D
      mm/vmalloc.c: fix memory ordering bug · 7e5b528b
      Dmitry Vyukov 提交于
      Read memory barriers must follow the read operations.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e5b528b
    • O
      oom: kill the insufficient and no longer needed PT_TRACE_EXIT check · 6a2d5679
      Oleg Nesterov 提交于
      After the previous patch we can remove the PT_TRACE_EXIT check in
      oom_scan_process_thread(), it was added to handle the case when the
      coredumping was "frozen" by ptrace, but it doesn't really work.  If
      nothing else, we would need to check all threads which could share the
      same ->mm to make it more or less correct.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a2d5679
    • O
      oom: don't assume that a coredumping thread will exit soon · d003f371
      Oleg Nesterov 提交于
      oom_kill.c assumes that PF_EXITING task should exit and free the memory
      soon.  This is wrong in many ways and one important case is the coredump.
      A task can sleep in exit_mm() "forever" while the coredumping sub-thread
      can need more memory.
      
      Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
      we add the new trivial helper for that.
      
      Note: this is only the first step, this patch doesn't try to solve other
      problems.  The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
      participate in coredump after it was already observed in PF_EXITING state,
      so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
      fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
      out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
       And even the name/usage of the new helper is confusing, an exiting thread
      can only free its ->mm if it is the only/last task in thread group.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d003f371
    • Z
      mm: remove the highmem zones' memmap in the highmem zone · ba914f48
      Zhong Hongbo 提交于
      Since 01cefaef ("mm: provide more accurate estimation
      of pages occupied by memmap") allocate the pages from lowmem for the
      highmem zones' memmap. So It is not need to reserver the memmap's for
      the highmem.
      
      A 2G DDR3 for the arm platform:
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 528 pages used for memmap
        HighMem zone: 67584 pages, LIFO batch:15
      
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 67584 pages, LIFO batch:15
      Signed-off-by: NHongbo Zhong <hongbo.zhong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba914f48
    • H
      mm: unmapped page migration avoid unmap+remap overhead · 2ebba6b7
      Hugh Dickins 提交于
      Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
      created for use on pages almost certainly mapped into userspace.  But
      nowadays compaction often applies them to unmapped page cache pages: which
      may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
      try_to_unmap_file() makes no preliminary page_mapped() check.
      
      Now check page_mapped() in __unmap_and_move(); and avoid repeating the
      same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
      never inserted any.
      
      (The PageAnon(page) comment blocks now look even sillier than before, but
      clean that up on some other occasion.  And note in passing that
      try_to_unmap_one() does not use a migration entry when PageSwapCache, so
      remove_migration_ptes() will then not update that swap entry to newpage
      pte: not a big deal, but something else to clean up later.)
      
      Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
      conversion to i_mmap_rwsem, that "The biggest winner of these changes is
      migration": a part of the reason might be all of that unnecessary taking
      of i_mmap_mutex in page migration; and it's rather a shame that I didn't
      get around to sending this patch in before his - this one is much less
      useful after Davidlohr's conversion to rwsem, but still good.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ebba6b7
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
    • D
      mm,vmacache: count number of system-wide flushes · f5f302e2
      Davidlohr Bueso 提交于
      These flushes deal with sequence number overflows, such as for long lived
      threads.  These are rare, but interesting from a debugging PoV.  As such,
      display the number of flushes when vmacache debugging is enabled.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f302e2
    • J
      mm/page_owner: correct owner information for early allocated pages · 61cf5feb
      Joonsoo Kim 提交于
      Extended memory to store page owner information is initialized some time
      later than that page allocator starts.  Until initialization, many pages
      can be allocated and they have no owner information.  This make debugging
      using page owner harder, so some fixup will be helpful.
      
      This patch fixes up this situation by setting fake owner information
      immediately after page extension is initialized.  Information doesn't tell
      the right owner, but, at least, it can tell whether page is allocated or
      not, more correctly.
      
      On my testing, this patch catches 13343 early allocated pages, although
      they are mostly allocated from page extension feature.  Anyway, after
      then, there is no page left that it is allocated and has no page owner
      flag.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61cf5feb
    • J
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim 提交于
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      
      Contributor:
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c96a36
    • J
      mm/nommu: use alloc_pages_exact() rather than its own implementation · dbc8358c
      Joonsoo Kim 提交于
      do_mmap_private() in nommu.c try to allocate physically contiguous pages
      with arbitrary size in some cases and we now have good abstract function
      to do exactly same thing, alloc_pages_exact().  So, change to use it.
      
      There is no functional change.  This is the preparation step for support
      page owner feature accurately.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbc8358c
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
    • J
      mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1
      Joonsoo Kim 提交于
      Until now, debug-pagealloc needs extra flags in struct page, so we need to
      recompile whole source code when we decide to use it.  This is really
      painful, because it takes some time to recompile and sometimes rebuild is
      not possible due to third party module depending on struct page.  So, we
      can't use this good feature in many cases.
      
      Now, we have the page extension feature that allows us to insert extra
      flags to outside of struct page.  This gets rid of third party module
      issue mentioned above.  And, this allows us to determine if we need extra
      memory for this page extension in boottime.  With these property, we can
      avoid using debug-pagealloc in boottime with low computational overhead in
      the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
      development process greatly.
      
      This patch is the preparation step to achive above goal.  debug-pagealloc
      originally uses extra field of struct page, but, after this patch, it will
      use field of struct page_ext.  Because memory for page_ext is allocated
      later than initialization of page allocator in CONFIG_SPARSEMEM, we should
      disable debug-pagealloc feature temporarily until initialization of
      page_ext.  This patch implements this.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30825f1
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
    • Z
      mm/memcontrol.c: remove the unused arg in __memcg_kmem_get_cache() · 056b7cce
      Zhang Zhen 提交于
      The gfp was passed in but never used in this function.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      056b7cce
    • J
      mm: export find_extend_vma() and handle_mm_fault() for driver use · e1d6d01a
      Jesse Barnes 提交于
      This lets drivers like the AMD IOMMUv2 driver handle faults a bit more
      simply, rather than doing tricks with page refs and get_user_pages().
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      Cc: Oded Gabbay <oded.gabbay@amd.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1d6d01a
    • L
      hugetlb: hugetlb_register_all_nodes(): add __init marker · 7d9ca000
      Luiz Capitulino 提交于
      This function is only called during initialization.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d9ca000
    • L
      hugetlb: alloc_bootmem_huge_page(): use IS_ALIGNED() · df994ead
      Luiz Capitulino 提交于
      No reason to duplicate the code of an existing macro.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df994ead
    • V
      memcg: turn memcg_kmem_skip_account into a bit field · 6f185c29
      Vladimir Davydov 提交于
      It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
      the task_struct.
      
      Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
      set/clear the flag inline.  Regarding the overwhelming comment to the
      helpers, which is removed by this patch too, we already have a compact yet
      accurate explanation in memcg_schedule_cache_create, no need in yet
      another one.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f185c29
    • V
      memcg: only check memcg_kmem_skip_account in __memcg_kmem_get_cache · 4e701d7b
      Vladimir Davydov 提交于
      __memcg_kmem_get_cache can recurse if it calls kmalloc (which it does if
      the cgroup's kmem cache doesn't exist), because kmalloc may call
      __memcg_kmem_get_cache internally again.  To avoid the recursion, we use
      the task_struct->memcg_kmem_skip_account flag.
      
      However, there's no need checking the flag in memcg_kmem_newpage_charge,
      because there's no way how this function could result in recursion, if
      called from memcg_kmem_get_cache.  So let's remove the redundant code.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e701d7b
    • V
      memcg: zap kmem_account_flags · 900a38f0
      Vladimir Davydov 提交于
      The only such flag is KMEM_ACCOUNTED_ACTIVE, but it's set iff
      mem_cgroup->kmemcg_id is initialized, so we can check kmemcg_id instead of
      having a separate flags field.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      900a38f0
    • W
      mm: mincore: add hwpoison page handle · c313dc5d
      Weijie Yang 提交于
      When the encountered pte is a swap entry, the current code handles two
      cases: migration and normal swapentry, but we have a third case: hwpoison
      page.
      
      This patch adds hwpoison page handle, consider hwpoison page incore as
      same as migration.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c313dc5d