1. 09 1月, 2015 6 次提交
    • V
      mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process being killed · 9e5e3661
      Vlastimil Babka 提交于
      Charles Shirron and Paul Cassella from Cray Inc have reported kswapd
      stuck in a busy loop with nothing left to balance, but
      kswapd_try_to_sleep() failing to sleep.  Their analysis found the cause
      to be a combination of several factors:
      
      1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait
      
      2. The process has been killed (by OOM in this case), but has not yet been
         scheduled to remove itself from the waitqueue and die.
      
      3. kswapd checks for throttled processes in prepare_kswapd_sleep():
      
              if (waitqueue_active(&pgdat->pfmemalloc_wait)) {
                      wake_up(&pgdat->pfmemalloc_wait);
      		return false; // kswapd will not go to sleep
      	}
      
         However, for a process that was already killed, wake_up() does not remove
         the process from the waitqueue, since try_to_wake_up() checks its state
         first and returns false when the process is no longer waiting.
      
      4. kswapd is running on the same CPU as the only CPU that the process is
         allowed to run on (through cpus_allowed, or possibly single-cpu system).
      
      5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd
         encounters no voluntary preemption points and repeatedly fails
         prepare_kswapd_sleep(), blocking the process from running and removing
         itself from the waitqueue, which would let kswapd sleep.
      
      So, the source of the problem is that we prevent kswapd from going to
      sleep until there are processes waiting on the pfmemalloc_wait queue,
      and a process waiting on a queue is guaranteed to be removed from the
      queue only when it gets scheduled.  This was done to make sure that no
      process is left sleeping on pfmemalloc_wait when kswapd itself goes to
      sleep.
      
      However, it isn't necessary to postpone kswapd sleep until the
      pfmemalloc_wait queue actually empties.  To prevent processes from being
      left sleeping, it's actually enough to guarantee that all processes
      waiting on pfmemalloc_wait queue have been woken up by the time we put
      kswapd to sleep.
      
      This patch therefore fixes this issue by substituting 'wake_up' with
      'wake_up_all' and removing 'return false' in the code snippet from
      prepare_kswapd_sleep() above.  Note that if any process puts itself in
      the queue after this waitqueue_active() check, or after the wake up
      itself, it means that the process will also wake up kswapd - and since
      we are under prepare_to_wait(), the wake up won't be missed.  Also we
      update the comment prepare_kswapd_sleep() to hopefully more clearly
      describe the races it is preventing.
      
      Fixes: 5515061d ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e5e3661
    • V
      memcg: fix destination cgroup leak on task charges migration · 4bdfc1c4
      Vladimir Davydov 提交于
      We are supposed to take one css reference per each memory page and per
      each swap entry accounted to a memory cgroup.  However, during task
      charges migration we take a reference to the destination cgroup twice
      per each swap entry: first in mem_cgroup_do_precharge()->try_charge()
      and then in mem_cgroup_move_swap_account(), permanently leaking the
      destination cgroup.
      
      The hunk taking the second reference seems to be a leftover from the
      pre-00501b53 ("mm: memcontrol: rewrite charge API") era.  Remove it
      to fix the leak.
      
      Fixes: e8ea14cc (mm: memcontrol: take a css reference for each charged page)
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bdfc1c4
    • J
      mm: memcontrol: switch soft limit default back to infinity · 24d404dc
      Johannes Weiner 提交于
      Commit 3e32cb2e ("mm: memcontrol: lockless page counters")
      accidentally switched the soft limit default from infinity to zero,
      which turns all memcgs with even a single page into soft limit excessors
      and engages soft limit reclaim on all of them during global memory
      pressure.  This makes global reclaim generally more aggressive, but also
      inverts the meaning of existing soft limit configurations where unset
      soft limits are usually more generous than set ones.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24d404dc
    • J
      mm/debug_pagealloc: remove obsolete Kconfig options · 70ecb3cb
      Joonsoo Kim 提交于
      These are obsolete since commit e30825f1 ("mm/debug-pagealloc:
      prepare boottime configurable") was merged.  So remove them.
      
      [pebolle@tiscali.nl: find obsolete Kconfig options]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ecb3cb
    • J
      mm: protect set_page_dirty() from ongoing truncation · 2d6d7f98
      Johannes Weiner 提交于
      Tejun, while reviewing the code, spotted the following race condition
      between the dirtying and truncation of a page:
      
      __set_page_dirty_nobuffers()       __delete_from_page_cache()
        if (TestSetPageDirty(page))
                                           page->mapping = NULL
      				     if (PageDirty())
      				       dec_zone_page_state(page, NR_FILE_DIRTY);
      				       dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
          if (page->mapping)
            account_page_dirtied(page)
              __inc_zone_page_state(page, NR_FILE_DIRTY);
      	__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
      
      which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
      
      Dirtiers usually lock out truncation, either by holding the page lock
      directly, or in case of zap_pte_range(), by pinning the mapcount with
      the page table lock held.  The notable exception to this rule, though,
      is do_wp_page(), for which this race exists.  However, do_wp_page()
      already waits for a locked page to unlock before setting the dirty bit,
      in order to prevent a race where clear_page_dirty() misses the page bit
      in the presence of dirty ptes.  Upgrade that wait to a fully locked
      set_page_dirty() to also cover the situation explained above.
      
      Afterwards, the code in set_page_dirty() dealing with a truncation race
      is no longer needed.  Remove it.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6d7f98
    • K
      mm: prevent endless growth of anon_vma hierarchy · 7a3ef208
      Konstantin Khlebnikov 提交于
      Constantly forking task causes unlimited grow of anon_vma chain.  Each
      next child allocates new level of anon_vmas and links vma to all
      previous levels because pages might be inherited from any level.
      
      This patch adds heuristic which decides to reuse existing anon_vma
      instead of forking new one.  It adds counter anon_vma->degree which
      counts linked vmas and directly descending anon_vmas and reuses anon_vma
      if counter is lower than two.  As a result each anon_vma has either vma
      or at least two descending anon_vmas.  In such trees half of nodes are
      leafs with alive vmas, thus count of anon_vmas is no more than two times
      bigger than count of vmas.
      
      This heuristic reuses anon_vmas as few as possible because each reuse
      adds false aliasing among vmas and rmap walker ought to scan more ptes
      when it searches where page is might be mapped.
      
      Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu
      Fixes: 5beb4930 ("mm: change anon_vma linking to fix multi-process server scalability issue")
      [akpm@linux-foundation.org: fix typo, per Rik]
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Reported-by: NDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Tested-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NJerome Marchand <jmarchan@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a3ef208
  2. 07 1月, 2015 1 次提交
    • L
      mm: propagate error from stack expansion even for guard page · fee7e49d
      Linus Torvalds 提交于
      Jay Foad reports that the address sanitizer test (asan) sometimes gets
      confused by a stack pointer that ends up being outside the stack vma
      that is reported by /proc/maps.
      
      This happens due to an interaction between RLIMIT_STACK and the guard
      page: when we do the guard page check, we ignore the potential error
      from the stack expansion, which effectively results in a missing guard
      page, since the expected stack expansion won't have been done.
      
      And since /proc/maps explicitly ignores the guard page (commit
      d7824370: "mm: fix up some user-visible effects of the stack guard
      page"), the stack pointer ends up being outside the reported stack area.
      
      This is the minimal patch: it just propagates the error.  It also
      effectively makes the guard page part of the stack limit, which in turn
      measn that the actual real stack is one page less than the stack limit.
      
      Let's see if anybody notices.  We could teach acct_stack_growth() to
      allow an extra page for a grow-up/grow-down stack in the rlimit test,
      but I don't want to add more complexity if it isn't needed.
      Reported-and-tested-by: NJay Foad <jay.foad@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fee7e49d
  3. 30 12月, 2014 1 次提交
    • M
      mm: get rid of radix tree gfp mask for pagecache_get_page · 45f87de5
      Michal Hocko 提交于
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") has added a separate parameter for
      specifying gfp mask for radix tree allocations.
      
      Not only this is less than optimal from the API point of view because it
      is error prone, it is also buggy currently because
      grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
      fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
      AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
      the radix tree allocation wouldn't obey the restriction and might
      recurse into filesystem and cause deadlocks.  This is the case for most
      filesystems unfortunately because only ext4 and gfs2 are using
      AOP_FLAG_NOFS.
      
      Let's simply remove radix_gfp_mask parameter because the allocation
      context is same for both page cache and for the radix tree.  Just make
      sure that the radix tree gets only the sane subset of the mask (e.g.  do
      not pass __GFP_WRITE).
      
      Long term it is more preferable to convert remaining users of
      AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
      interface even further.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f87de5
  4. 23 12月, 2014 1 次提交
  5. 19 12月, 2014 4 次提交
    • G
      mm/zsmalloc: adjust order of functions · 66cdef66
      Ganesh Mahendran 提交于
      Currently functions in zsmalloc.c does not arranged in a readable and
      reasonable sequence.  With the more and more functions added, we may
      meet below inconvenience.  For example:
      
      Current functions:
      
          void zs_init()
          {
          }
      
          static void get_maxobj_per_zspage()
          {
          }
      
      Then I want to add a func_1() which is called from zs_init(), and this
      new added function func_1() will used get_maxobj_per_zspage() which is
      defined below zs_init().
      
          void func_1()
          {
              get_maxobj_per_zspage()
          }
      
          void zs_init()
          {
              func_1()
          }
      
          static void get_maxobj_per_zspage()
          {
          }
      
      This will cause compiling issue. So we must add a declaration:
      
          static void get_maxobj_per_zspage();
      
      before func_1() if we do not put get_maxobj_per_zspage() before
      func_1().
      
      In addition, puting module_[init|exit] functions at the bottom of the
      file conforms to our habit.
      
      So, this patch ajusts function sequence as:
      
          /* helper functions */
          ...
          obj_location_to_handle()
          ...
      
          /* Some exported functions */
          ...
      
          zs_map_object()
          zs_unmap_object()
      
          zs_malloc()
          zs_free()
      
          zs_init()
          zs_exit()
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66cdef66
    • A
      mm/memory.c:do_shared_fault(): add comment · d82fa87d
      Andrew Morton 提交于
      Belatedly document the changes in commit f0c6d4d2 ("mm: introduce
      do_shared_fault() and drop do_fault()").
      
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d82fa87d
    • P
      mm: cma: split cma-reserved in dmesg log · e48322ab
      Pintu Kumar 提交于
      When the system boots up, in the dmesg logs we can see the memory
      statistics along with total reserved as below.  Memory: 458840k/458840k
      available, 65448k reserved, 0K highmem
      
      When CMA is enabled, still the total reserved memory remains the same.
      However, the CMA memory is not considered as reserved.  But, when we see
      /proc/meminfo, the CMA memory is part of free memory.  This creates
      confusion.  This patch corrects the problem by properly subtracting the
      CMA reserved memory from the total reserved memory in dmesg logs.
      
      Below is the dmesg snapshot from an arm based device with 512MB RAM and
      12MB single CMA region.
      
      Before this change:
        Memory: 458840k/458840k available, 65448k reserved, 0K highmem
      
      After this change:
        Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NVishnu Pratap Singh <vishnu.ps@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e48322ab
    • Z
      mm/mempolicy.c: remove unnecessary is_valid_nodemask() · 859f7ef1
      Zhihui Zhang 提交于
      When nodes is true, nsc->mask2 has already been filtered by nsc->mask1,
      which has already factored in node_states[N_MEMORY].
      Signed-off-by: NZhihui Zhang <zzhsuny@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859f7ef1
  6. 18 12月, 2014 2 次提交
  7. 17 12月, 2014 2 次提交
  8. 14 12月, 2014 23 次提交
    • P
      aio: Make it possible to remap aio ring · e4a0d3e7
      Pavel Emelyanov 提交于
      There are actually two issues this patch addresses. Let me start with
      the one I tried to solve in the beginning.
      
      So, in the checkpoint-restore project (criu) we try to dump tasks'
      state and restore one back exactly as it was. One of the tasks' state
      bits is rings set up with io_setup() call. There's (almost) no problems
      in dumping them, there's a problem restoring them -- if I dump a task
      with aio ring originally mapped at address A, I want to restore one
      back at exactly the same address A. Unfortunately, the io_setup() does
      not allow for that -- it mmaps the ring at whatever place mm finds
      appropriate (it calls do_mmap_pgoff() with zero address and without
      the MAP_FIXED flag).
      
      To make restore possible I'm going to mremap() the freshly created ring
      into the address A (under which it was seen before dump). The problem is
      that the ring's virtual address is passed back to the user-space as the
      context ID and this ID is then used as search key by all the other io_foo()
      calls. Reworking this ID to be just some integer doesn't seem to work, as
      this value is already used by libaio as a pointer using which this library
      accesses memory for aio meta-data.
      
      So, to make restore work we need to make sure that
      
      a) ring is mapped at desired virtual address
      b) kioctx->user_id matches this value
      
      Having said that, the patch makes mremap() on aio region update the
      kioctx's user_id and mmap_base values.
      
      Here appears the 2nd issue I mentioned in the beginning of this mail.
      If (regardless of the C/R dances I do) someone creates an io context
      with io_setup(), then mremap()-s the ring and then destroys the context,
      the kill_ioctx() routine will call munmap() on wrong (old) address.
      This will result in a) aio ring remaining in memory and b) some other
      vma get unexpectedly unmapped.
      
      What do you think?
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      e4a0d3e7
    • T
      mm/cma: make kmemleak ignore CMA regions · 620951e2
      Thierry Reding 提交于
      kmemleak will add allocations as objects to a pool.  The memory allocated
      for each object in this pool is periodically searched for pointers to
      other allocated objects.  This only works for memory that is mapped into
      the kernel's virtual address space, which happens not to be the case for
      most CMA regions.
      
      Furthermore, CMA regions are typically used to store data transferred to
      or from a device and therefore don't contain pointers to other objects.
      
      Without this, the kernel crashes on the first execution of the
      scan_gray_list() because it tries to access highmem.  Perhaps a more
      appropriate fix would be to reject any object that can't map to a kernel
      virtual address?
      
      [akpm@linux-foundation.org: add comment]
      [akpm@linux-foundation.org: fix comment, per Catalin]
      [sfr@canb.auug.org.au: include linux/io.h for phys_to_virt()]
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      620951e2
    • V
      slub: fix cpuset check in get_any_partial · dee2f8aa
      Vladimir Davydov 提交于
      If we fail to allocate from the current node's stock, we look for free
      objects on other nodes before calling the page allocator (see
      get_any_partial).  While checking other nodes we respect cpuset
      constraints by calling cpuset_zone_allowed.  We enforce hardwall check.
      As a result, we will fallback to the page allocator even if there are some
      pages cached on other nodes, but the current cpuset doesn't have them set.
       However, the page allocator uses softwall check for kernel allocations,
      so it may allocate from one of the other nodes in this case.
      
      Therefore we should use softwall cpuset check in get_any_partial to
      conform with the cpuset check in the page allocator.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dee2f8aa
    • V
      slab: fix cpuset check in fallback_alloc · 061d7074
      Vladimir Davydov 提交于
      fallback_alloc is called on kmalloc if the preferred node doesn't have
      free or partial slabs and there's no pages on the node's free list
      (GFP_THISNODE allocations fail).  Before invoking the reclaimer it tries
      to locate a free or partial slab on other allowed nodes' lists.  While
      iterating over the preferred node's zonelist it skips those zones which
      hardwall cpuset check returns false for.  That means that for a task bound
      to a specific node using cpusets fallback_alloc will always ignore free
      slabs on other nodes and go directly to the reclaimer, which, however, may
      allocate from other nodes if cpuset.mem_hardwall is unset (default).  As a
      result, we may get lists of free slabs grow without bounds on other nodes,
      which is bad, because inactive slabs are only evicted by cache_reap at a
      very slow rate and cannot be dropped forcefully.
      
      To reproduce the issue, run a process that will walk over a directory tree
      with lots of files inside a cpuset bound to a node that constantly
      experiences memory pressure.  Look at num_slabs vs active_slabs growth as
      reported by /proc/slabinfo.
      
      To avoid this we should use softwall cpuset check in fallback_alloc.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      061d7074
    • H
      mm/zbud: init user ops only when it is needed · 1dd61aa3
      Heesub Shin 提交于
      When zbud is initialized through the zpool wrapper, pool->ops which
      points to user-defined operations is always set regardless of whether it
      is specified from the upper layer. This causes zbud_reclaim_page() to
      iterate its loop for evicting pool pages out without any gain.
      
      This patch sets the user-defined ops only when it is needed, so that
      zbud_reclaim_page() can bail out the reclamation loop earlier if there
      is no user-defined operations specified.
      Signed-off-by: NHeesub Shin <heesub.shin@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Sunae Seo <sunae.seo@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd61aa3
    • M
      mm/zswap: delete unnecessary check before calling free_percpu() · 442cc432
      Markus Elfring 提交于
      free_percpu() tests whether its argument is NULL and then returns
      immediately.  Thus the test around the call is not needed.
      
      This issue was detected by using the Coccinelle software.
      Signed-off-by: NMarkus Elfring <elfring@users.sourceforge.net>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      442cc432
    • M
      mm/zswap: add __init to some functions in zswap · dd01d7d8
      Mahendran Ganesh 提交于
      zswap_cpu_init/zswap_comp_exit/zswap_entry_cache_create is only called by
      __init init_zswap()
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd01d7d8
    • G
      mm/zsmalloc: allocate exactly size of struct zs_pool · 18136656
      Ganesh Mahendran 提交于
      In zs_create_pool(), we allocate memory more then sizeof(struct zs_pool)
        ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);
      
      This patch allocate memory of exactly needed size.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18136656
    • G
      mm/zsmalloc: avoid duplicate assignment of prev_class · df8b5bb9
      Ganesh Mahendran 提交于
      In zs_create_pool(), prev_class is assigned (ZS_SIZE_CLASSES - 1) times.
      And the prev_class only references to the previous size_class.  So we do
      not need unnecessary assignement.
      
      This patch assigns *prev_class* when a new size_class structure is
      allocated and uses prev_class to check whether the first class has been
      allocated.
      
      [akpm@linux-foundation.org: remove now-unused ZS_SIZE_CLASSES]
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df8b5bb9
    • M
      mm/zsmalloc: support allocating obj with size of ZS_MAX_ALLOC_SIZE · 40f9fb8c
      Mahendran Ganesh 提交于
      I sent a patch [1] for unnecessary check in zsmalloc.  And Minchan Kim
      found zsmalloc even does not support allocating an obj with the size of
      ZS_MAX_ALLOC_SIZE in some situations.
      
      For example:
         In system with 64KB PAGE_SIZE and 32 bit of physical addr. Then:
         ZS_MIN_ALLOC_SIZE is 32 bytes which is calculated by:
            MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
         ZS_MAX_ALLOC_SIZE is 64KB(in current code, is PAGE_SIZE)
         ZS_SIZE_CLASS_DELTA is 256 bytes
         So, ZS_SIZE_CLASSES = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) /
                                ZS_SIZE_CLASS_DELTA + 1
                             = 256
      
         In zs_create_pool(), the max size obj which can be allocated will be:
            ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA = 32 + 255*256 = 65312
      
         We can see that 65312 < 65536 (ZS_MAX_ALLOC_SIZE). So we can NOT
         allocate objs with size ZS_MAX_ALLOC_SIZE(65536) which we promise upper
         users we can do.
      
       [1]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/03835.html
       [2]  http://lkml.iu.edu/hypermail/linux/kernel/1411.2/04534.html
      
      This patch fixes this issue by dynamiclly calculating zs_size_classes when
      module is loaded, allocates buffer with size ZS_MAX_ALLOC_SIZE.  Then the
      max obj(size is ZS_MAX_ALLOC_SIZE) can be stored in it.
      
      [akpm@linux-foundation.org: restore ZS_SIZE_CLASSES to fix bisectability]
      Signed-off-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40f9fb8c
    • M
      zsmalloc: correct fragile [kmap|kunmap]_atomic use · af4ee5e9
      Minchan Kim 提交于
      The kunmap_atomic should use virtual address getting by kmap_atomic.
      However, some pieces of code in zsmalloc uses modified address, not the
      one got by kmap_atomic for kunmap_atomic.
      
      It's okay for working because zsmalloc modifies the address inner
      PAGE_SIZE bounday so it works with current kmap_atomic's implementation.
      But it's still fragile with potential changing of kmap_atomic so let's
      correct it.
      
      I got a subtle bug when I implemented a new feature of zsmalloc
      (compaction) due to a link's mishandling (the link was over page
      boundary).  Although it was totally my mistake, it took a while to find
      the cause because an unpredictable kmapped address was unmapped causing an
      almost random crash.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af4ee5e9
    • S
      zsmalloc: fix zs_init cpu notifier error handling · b1b00a5b
      Sergey Senozhatsky 提交于
      Mahendran Ganesh reported that zpool-enabled zsmalloc should not call
      zpool_unregister_driver() from zs_init() if cpu notifier registration has
      failed, because error handling is performed before we register the driver
      via zpool_register_driver() call.
      
      Factor out cpu notifier registration and unregistration code and fix
      zs_init() error handling.
      
      link: http://lkml.iu.edu//hypermail/linux/kernel/1411.1/04156.html
      [akpm@linux-foundation.org: squash bogus gcc warning]
      [akpm@linux-foundation.org: use __init and __exit]
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reported-by: NMahendran Ganesh <opensource.ganesh@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b1b00a5b
    • J
      zsmalloc: merge size_class to reduce fragmentation · 9eec4cd5
      Joonsoo Kim 提交于
      zsmalloc has many size_classes to reduce fragmentation and they are in 16
      bytes unit, for example, 16, 32, 48, etc., if PAGE_SIZE is 4096.  And,
      zsmalloc has constraint that each zspage has 4 pages at maximum.
      
      In this situation, we can see interesting aspect.  Let's think about
      size_class for 1488, 1472, ..., 1376.  To prevent external fragmentation,
      they uses 4 pages per zspage and so all they can contain 11 objects at
      maximum.
      
      16384 (4096 * 4) = 1488 * 11 + remains
      16384 (4096 * 4) = 1472 * 11 + remains
      16384 (4096 * 4) = ...
      16384 (4096 * 4) = 1376 * 11 + remains
      
      It means that they have same characteristics and classification between
      them isn't needed.  If we use one size_class for them, we can reduce
      fragementation and save some memory since both the 1488 and 1472 sized
      classes can only fit 11 objects into 4 pages, and an object that's 1472
      bytes can fit into an object that's 1488 bytes, merging these classes to
      always use objects that are 1488 bytes will reduce the total number of
      size classes.  And reducing the total number of size classes reduces
      overall fragmentation, because a wider range of compressed pages can fit
      into a single size class, leaving less unused objects in each size class.
      
      For this purpose, this patch implement size_class merging.  If there is
      size_class that have same pages_per_zspage and same number of objects per
      zspage with previous size_class, we don't create new size_class.  Instead,
      we use previous, same characteristic size_class.  With this way, above
      example sizes (1488, 1472, ..., 1376) use just one size_class so we can
      get much more memory utilization.
      
      Below is result of my simple test.
      
      TEST ENV: EXT4 on zram, mount with discard option WORKLOAD: untar kernel
      source code, remove directory in descending order in size.  (drivers arch
      fs sound include net Documentation firmware kernel tools)
      
      Each line represents orig_data_size, compr_data_size, mem_used_total,
      fragmentation overhead (mem_used - compr_data_size) and overhead ratio
      (overhead to compr_data_size), respectively, after untar and remove
      operation is executed.
      
      * untar-nomerge.out
      
      orig_size compr_size used_size overhead overhead_ratio
      525.88MB 199.16MB 210.23MB  11.08MB 5.56%
      288.32MB  97.43MB 105.63MB   8.20MB 8.41%
      177.32MB  61.12MB  69.40MB   8.28MB 13.55%
      146.47MB  47.32MB  56.10MB   8.78MB 18.55%
      124.16MB  38.85MB  48.41MB   9.55MB 24.58%
      103.93MB  31.68MB  40.93MB   9.25MB 29.21%
       84.34MB  22.86MB  32.72MB   9.86MB 43.13%
       66.87MB  14.83MB  23.83MB   9.00MB 60.70%
       60.67MB  11.11MB  18.60MB   7.49MB 67.48%
       55.86MB   8.83MB  16.61MB   7.77MB 88.03%
       53.32MB   8.01MB  15.32MB   7.31MB 91.24%
      
      * untar-merge.out
      
      orig_size compr_size used_size overhead overhead_ratio
      526.23MB 199.18MB 209.81MB  10.64MB 5.34%
      288.68MB  97.45MB 104.08MB   6.63MB 6.80%
      177.68MB  61.14MB  66.93MB   5.79MB 9.47%
      146.83MB  47.34MB  52.79MB   5.45MB 11.51%
      124.52MB  38.87MB  44.30MB   5.43MB 13.96%
      104.29MB  31.70MB  36.83MB   5.13MB 16.19%
       84.70MB  22.88MB  27.92MB   5.04MB 22.04%
       67.11MB  14.83MB  19.26MB   4.43MB 29.86%
       60.82MB  11.10MB  14.90MB   3.79MB 34.17%
       55.90MB   8.82MB  12.61MB   3.79MB 42.97%
       53.32MB   8.01MB  11.73MB   3.73MB 46.53%
      
      As you can see above result, merged one has better utilization (overhead
      ratio, 5th column) and uses less memory (mem_used_total, 3rd column).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: <juno.choi@lge.com>
      Cc: "seungho1.park" <seungho1.park@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9eec4cd5
    • R
      mm/memcontrol.c: remove unused mem_cgroup_lru_names_not_uptodate() · 70bc068c
      Rickard Strandqvist 提交于
      Remove unused mem_cgroup_lru_names_not_uptodate() and move BUILD_BUG_ON()
      to the beginning of memcg_stat_show().
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70bc068c
    • V
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov 提交于
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
    • M
      mm/memcontrol.c: fix defined but not used compiler warning · ae6e71d3
      Michele Curti 提交于
      test_mem_cgroup_node_reclaimable() is used only when MAX_NUMNODES > 1, so
      move it into the compiler if statement
      
      [akpm@linux-foundation.org: clean up layout]
      Signed-off-by: NMichele Curti <michele.curti@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae6e71d3
    • M
      mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages · 441c228f
      Mel Gorman 提交于
      A random seek IO benchmark appeared to regress because of a change to
      readahead but the real problem was the benchmark.  To ensure the IO
      request accesssed disk, it used fadvise(FADV_DONTNEED) on a block boundary
      (512K) but the hint is ignored by the kernel.  This is correct but not
      necessarily obvious behaviour.  As much as I dislike comment patches, the
      explanation for this behaviour predates current git history.  Clarify why
      it behaves like this in case someone "fixes" fadvise or readahead for the
      wrong reasons.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      441c228f
    • D
      mm/vmalloc.c: fix memory ordering bug · 7e5b528b
      Dmitry Vyukov 提交于
      Read memory barriers must follow the read operations.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e5b528b
    • O
      oom: kill the insufficient and no longer needed PT_TRACE_EXIT check · 6a2d5679
      Oleg Nesterov 提交于
      After the previous patch we can remove the PT_TRACE_EXIT check in
      oom_scan_process_thread(), it was added to handle the case when the
      coredumping was "frozen" by ptrace, but it doesn't really work.  If
      nothing else, we would need to check all threads which could share the
      same ->mm to make it more or less correct.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a2d5679
    • O
      oom: don't assume that a coredumping thread will exit soon · d003f371
      Oleg Nesterov 提交于
      oom_kill.c assumes that PF_EXITING task should exit and free the memory
      soon.  This is wrong in many ways and one important case is the coredump.
      A task can sleep in exit_mm() "forever" while the coredumping sub-thread
      can need more memory.
      
      Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
      we add the new trivial helper for that.
      
      Note: this is only the first step, this patch doesn't try to solve other
      problems.  The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
      participate in coredump after it was already observed in PF_EXITING state,
      so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
      fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
      out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
       And even the name/usage of the new helper is confusing, an exiting thread
      can only free its ->mm if it is the only/last task in thread group.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d003f371
    • Z
      mm: remove the highmem zones' memmap in the highmem zone · ba914f48
      Zhong Hongbo 提交于
      Since 01cefaef ("mm: provide more accurate estimation
      of pages occupied by memmap") allocate the pages from lowmem for the
      highmem zones' memmap. So It is not need to reserver the memmap's for
      the highmem.
      
      A 2G DDR3 for the arm platform:
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 528 pages used for memmap
        HighMem zone: 67584 pages, LIFO batch:15
      
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 67584 pages, LIFO batch:15
      Signed-off-by: NHongbo Zhong <hongbo.zhong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba914f48
    • H
      mm: unmapped page migration avoid unmap+remap overhead · 2ebba6b7
      Hugh Dickins 提交于
      Page migration's __unmap_and_move(), and rmap's try_to_unmap(), were
      created for use on pages almost certainly mapped into userspace.  But
      nowadays compaction often applies them to unmapped page cache pages: which
      may exacerbate contention on i_mmap_rwsem quite unnecessarily, since
      try_to_unmap_file() makes no preliminary page_mapped() check.
      
      Now check page_mapped() in __unmap_and_move(); and avoid repeating the
      same overhead in rmap_walk_file() - don't remove_migration_ptes() when we
      never inserted any.
      
      (The PageAnon(page) comment blocks now look even sillier than before, but
      clean that up on some other occasion.  And note in passing that
      try_to_unmap_one() does not use a migration entry when PageSwapCache, so
      remove_migration_ptes() will then not update that swap entry to newpage
      pte: not a big deal, but something else to clean up later.)
      
      Davidlohr remarked in "mm,fs: introduce helpers around the i_mmap_mutex"
      conversion to i_mmap_rwsem, that "The biggest winner of these changes is
      migration": a part of the reason might be all of that unnecessary taking
      of i_mmap_mutex in page migration; and it's rather a shame that I didn't
      get around to sending this patch in before his - this one is much less
      useful after Davidlohr's conversion to rwsem, but still good.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ebba6b7
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799