1. 05 6月, 2014 32 次提交
    • V
      memcg: get rid of memcg_create_cache_name · 073ee1c6
      Vladimir Davydov 提交于
      Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
      order to just create the name of a per memcg cache, let's allocate it in
      place.  We only need to pass the memcg name to kmem_cache_create_memcg for
      that - everything else can be done in slab_common.c.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073ee1c6
    • K
      mm: update comment for DEFAULT_MAX_MAP_COUNT · 3fb1c8dc
      Kirill A. Shutemov 提交于
      With ELF extended numbering 16-bit bound is not hard limit any more.
      
      [akpm@linux-foundation.org: fix typo]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fb1c8dc
    • K
      mm/rmap.c: make page_referenced_one() and try_to_unmap_one() static · ac769501
      Kirill A. Shutemov 提交于
      KSM was converted to use rmap_walk() and now nobody uses these functions
      outside mm/rmap.c.
      
      Let's covert them back to static.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac769501
    • D
      mm: shrinker: add nid to tracepoint output · df9024a8
      Dave Hansen 提交于
      Now that we are doing NUMA-aware shrinking, and can have shrinkers
      running in parallel, or working on individual nodes, it seems like we
      should also be sticking the node in the output.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df9024a8
    • D
      mm: shrinker trace points: fix negatives · 7fe70475
      Dave Hansen 提交于
      I was looking at a trace of the slab shrinkers (attachment in this comment):
      
      	https://bugs.freedesktop.org/show_bug.cgi?id=72742#c67
      
      and noticed that "total_scan" can go negative in some cases.  We
      used to dump out the "total_scan" variable directly, but some of
      the shrinker modifications along the way changed that.
      
      This patch just dumps it out directly, again.  It doesn't make
      any sense to derive it from new_nr and nr any more since there
      are now other shrinkers that can be running in parallel and
      mucking with those values.
      
      Here's an example of the negative numbers in the output:
      
      >          kswapd0-840   [000]   160.869398: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 10 new scan count 39 total_scan 29 last shrinker return val 256
      >          kswapd0-840   [000]   160.869618: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 39 new scan count 102 total_scan 63 last shrinker return val 256
      >          kswapd0-840   [000]   160.870031: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 102 new scan count 47 total_scan -55 last shrinker return val 768
      >          kswapd0-840   [000]   160.870464: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 47 new scan count 45 total_scan -2 last shrinker return val 768
      >          kswapd0-840   [000]   163.384144: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 45 new scan count 56 total_scan 11 last shrinker return val 0
      >          kswapd0-840   [000]   163.384297: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 56 new scan count 15 total_scan -41 last shrinker return val 256
      >          kswapd0-840   [000]   163.384414: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 15 new scan count 117 total_scan 102 last shrinker return val 0
      >          kswapd0-840   [000]   163.384657: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 117 new scan count 36 total_scan -81 last shrinker return val 512
      >          kswapd0-840   [000]   163.384880: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 36 new scan count 111 total_scan 75 last shrinker return val 256
      >          kswapd0-840   [000]   163.385256: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 111 new scan count 34 total_scan -77 last shrinker return val 768
      >          kswapd0-840   [000]   163.385598: mm_shrink_slab_end:   i915_gem_inactive_scan+0x0 0xffff8800037cbc68: unused scan count 34 new scan count 122 total_scan 88 last shrinker return val 512
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NDave Chinner <david@fromorbit.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fe70475
    • W
      include/linux/bootmem.h: cleanup the comment for BOOTMEM_ flags · 1754e44e
      Wang Sheng-Hui 提交于
      Use BOOTMEM_DEFAULT instead of 0 in the comment.
      Signed-off-by: NWang Sheng-Hui <shhuiw@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1754e44e
    • J
      mm: introdule compound_head_by_tail() · d2ee40ea
      Jianyu Zhan 提交于
      Currently, in put_compound_page(), we have
      
      ======
      if (likely(!PageTail(page))) {                  <------  (1)
              if (put_page_testzero(page)) {
                       /*
                       ¦* By the time all refcounts have been released
                       ¦* split_huge_page cannot run anymore from under us.
                       ¦*/
                       if (PageHead(page))
                               __put_compound_page(page);
                       else
                               __put_single_page(page);
               }
               return;
      }
      
      /* __split_huge_page_refcount can run under us */
      page_head = compound_head(page);        <------------ (2)
      ======
      
      if at (1) ,  we fail the check, this means page is *likely* a tail page.
      
      Then at (2), as compoud_head(page) is inlined, it is :
      
      ======
      static inline struct page *compound_head(struct page *page)
      {
                if (unlikely(PageTail(page))) {           <----------- (3)
                    struct page *head = page->first_page;
      
                      smp_rmb();
                      if (likely(PageTail(page)))
                              return head;
              }
              return page;
      }
      ======
      
      here, the (3) unlikely in the case is a negative hint, because it is
      *likely* a tail page.  So the check (3) in this case is not good, so I
      introduce a helper for this case.
      
      So this patch introduces compound_head_by_tail() which deals with a
      possible tail page(though it could be spilt by a racy thread), and make
      compound_head() a wrapper on it.
      
      This patch has no functional change, and it reduces the object
      size slightly:
         text    data     bss     dec     hex  filename
        11003    1328      16   12347    303b  mm/swap.o.orig
        10971    1328      16   12315    301b  mm/swap.o.patched
      
      I've ran "perf top -e branch-miss" to observe branch-miss in this case.
      As Michael points out, it's a slow path, so only very few times this case
      happens.  But I grep'ed the code base, and found there still are some
      other call sites could be benifited from this helper.  And given that it
      only bloating up the source by only 5 lines, but with a reduced object
      size.  I still believe this helper deserves to exsit.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ee40ea
    • R
      mm: constify nmask argument to set_mempolicy() · 23c8902d
      Rasmus Villemoes 提交于
      The nmask argument to set_mempolicy() is const according to the user-space
      header numaif.h, and since the kernel does indeed not modify it, it might
      as well be declared const in the kernel.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23c8902d
    • R
      mm: constify nmask argument to mbind() · f7f28ca9
      Rasmus Villemoes 提交于
      The nmask argument to mbind() is const according to the userspace header
      numaif.h, and since the kernel does indeed not modify it, it might as well
      be declared const in the kernel.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7f28ca9
    • M
      fs/block_dev.c: add bdev_read_page() and bdev_write_page() · 47a191fd
      Matthew Wilcox 提交于
      A block device driver may choose to provide a rw_page operation.  These
      will be called when the filesystem is attempting to do page sized I/O to
      page cache pages (ie not for direct I/O).  This does preclude I/Os that
      are larger than page size, so this may only be a performance gain for
      some devices.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Tested-by: NDheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47a191fd
    • M
      fs/mpage.c: factor page_endio() out of mpage_end_io() · 57d99845
      Matthew Wilcox 提交于
      page_endio() takes care of updating all the appropriate page flags once
      I/O has finished to a page.  Switch to using mapping_set_error() instead
      of setting AS_EIO directly; this will handle thin-provisioned devices
      correctly.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57d99845
    • M
      fs/buffer.c: remove block_write_full_page_endio() · 1b938c08
      Matthew Wilcox 提交于
      The last in-tree caller of block_write_full_page_endio() was removed in
      January 2013.  It's time to remove the EXPORT_SYMBOL, which leaves
      block_write_full_page() as the only caller of
      block_write_full_page_endio(), so inline block_write_full_page_endio()
      into block_write_full_page().
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dheeraj Reddy <dheeraj.reddy@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b938c08
    • V
      memcg, slab: simplify synchronization scheme · bd673145
      Vladimir Davydov 提交于
      At present, we have the following mutexes protecting data related to per
      memcg kmem caches:
      
       - slab_mutex.  This one is held during the whole kmem cache creation
         and destruction paths.  We also take it when updating per root cache
         memcg_caches arrays (see memcg_update_all_caches).  As a result, taking
         it guarantees there will be no changes to any kmem cache (including per
         memcg).  Why do we need something else then?  The point is it is
         private to slab implementation and has some internal dependencies with
         other mutexes (get_online_cpus).  So we just don't want to rely upon it
         and prefer to introduce additional mutexes instead.
      
       - activate_kmem_mutex.  Initially it was added to synchronize
         initializing kmem limit (memcg_activate_kmem).  However, since we can
         grow per root cache memcg_caches arrays only on kmem limit
         initialization (see memcg_update_all_caches), we also employ it to
         protect against memcg_caches arrays relocation (e.g.  see
         __kmem_cache_destroy_memcg_children).
      
       - We have a convention not to take slab_mutex in memcontrol.c, but we
         want to walk over per memcg memcg_slab_caches lists there (e.g.  for
         destroying all memcg caches on offline).  So we have per memcg
         slab_caches_mutex's protecting those lists.
      
      The mutexes are taken in the following order:
      
         activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
      
      Such a syncrhonization scheme has a number of flaws, for instance:
      
       - We can't call kmem_cache_{destroy,shrink} while walking over a
         memcg::memcg_slab_caches list due to locking order.  As a result, in
         mem_cgroup_destroy_all_caches we schedule the
         memcg_cache_params::destroy work shrinking and destroying the cache.
      
       - We don't have a mutex to synchronize per memcg caches destruction
         between memcg offline (mem_cgroup_destroy_all_caches) and root cache
         destruction (__kmem_cache_destroy_memcg_children).  Currently we just
         don't bother about it.
      
      This patch simplifies it by substituting per memcg slab_caches_mutex's
      with the global memcg_slab_mutex.  It will be held whenever a new per
      memcg cache is created or destroyed, so it protects per root cache
      memcg_caches arrays and per memcg memcg_slab_caches lists.  The locking
      order is following:
      
         activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
      
      This allows us to call kmem_cache_{create,shrink,destroy} under the
      memcg_slab_mutex.  As a result, we don't need memcg_cache_params::destroy
      work any more - we can simply destroy caches while iterating over a per
      memcg slab caches list.
      
      Also using the global mutex simplifies synchronization between concurrent
      per memcg caches creation/destruction, e.g.  mem_cgroup_destroy_all_caches
      vs __kmem_cache_destroy_memcg_children.
      
      The downside of this is that we substitute per-memcg slab_caches_mutex's
      with a hummer-like global mutex, but since we already take either the
      slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
      shouldn't hurt concurrency a lot.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd673145
    • V
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov 提交于
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • V
      memcg, slab: do not schedule cache destruction when last page goes away · 1e32e77f
      Vladimir Davydov 提交于
      This patchset is a part of preparations for kmemcg re-parenting.  It
      targets at simplifying kmemcg work-flows and synchronization.
      
      First, it removes async per memcg cache destruction (see patches 1, 2).
      Now caches are only destroyed on memcg offline.  That means the caches
      that are not empty on memcg offline will be leaked.  However, they are
      already leaked, because memcg_cache_params::nr_pages normally never drops
      to 0 so the destruction work is never scheduled except kmem_cache_shrink
      is called explicitly.  In the future I'm planning reaping such dead caches
      on vmpressure or periodically.
      
      Second, it substitutes per memcg slab_caches_mutex's with the global
      memcg_slab_mutex, which should be taken during the whole per memcg cache
      creation/destruction path before the slab_mutex (see patch 3).  This
      greatly simplifies synchronization among various per memcg cache
      creation/destruction paths.
      
      I'm still not quite sure about the end picture, in particular I don't know
      whether we should reap dead memcgs' kmem caches periodically or try to
      merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
      more details), but whichever way we choose, this set looks like a
      reasonable change to me, because it greatly simplifies kmemcg work-flows
      and eases further development.
      
      This patch (of 3):
      
      After a memcg is offlined, we mark its kmem caches that cannot be deleted
      right now due to pending objects as dead by setting the
      memcg_cache_params::dead flag, so that memcg_release_pages will schedule
      cache destruction (memcg_cache_params::destroy) as soon as the last slab
      of the cache is freed (memcg_cache_params::nr_pages drops to zero).
      
      I guess the idea was to destroy the caches as soon as possible, i.e.
      immediately after freeing the last object.  However, it just doesn't work
      that way, because kmem caches always preserve some pages for the sake of
      performance, so that nr_pages never gets to zero unless the cache is
      shrunk explicitly using kmem_cache_shrink.  Of course, we could account
      the total number of objects on the cache or check if all the slabs
      allocated for the cache are empty on kmem_cache_free and schedule
      destruction if so, but that would be too costly.
      
      Thus we have a piece of code that works only when we explicitly call
      kmem_cache_shrink, but complicates the whole picture a lot.  Moreover,
      it's racy in fact.  For instance, kmem_cache_shrink may free the last slab
      and thus schedule cache destruction before it finishes checking that the
      cache is empty, which can lead to use-after-free.
      
      So I propose to remove this async cache destruction from
      memcg_release_pages, and check if the cache is empty explicitly after
      calling kmem_cache_shrink instead.  This will simplify things a lot w/o
      introducing any functional changes.
      
      And regarding dead memcg caches (i.e.  those that are left hanging around
      after memcg offline for they have objects), I suppose we should reap them
      either periodically or on vmpressure as Glauber suggested initially.  I'm
      going to implement this later.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e32e77f
    • O
      memcg: kill CONFIG_MM_OWNER · f98bafa0
      Oleg Nesterov 提交于
      CONFIG_MM_OWNER makes no sense.  It is not user-selectable, it is only
      selected by CONFIG_MEMCG automatically.  So we can kill this option in
      init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f98bafa0
    • J
      mm/swap.c: clean up *lru_cache_add* functions · 2329d375
      Jianyu Zhan 提交于
      In mm/swap.c, __lru_cache_add() is exported, but actually there are no
      users outside this file.
      
      This patch unexports __lru_cache_add(), and makes it static.  It also
      exports lru_cache_add_file(), as it is use by cifs and fuse, which can
      loaded as modules.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2329d375
    • V
      mem-hotplug: implement get/put_online_mems · bfc8c901
      Vladimir Davydov 提交于
      kmem_cache_{create,destroy,shrink} need to get a stable value of
      cpu/node online mask, because they init/destroy/access per-cpu/node
      kmem_cache parts, which can be allocated or destroyed on cpu/mem
      hotplug.  To protect against cpu hotplug, these functions use
      {get,put}_online_cpus.  However, they do nothing to synchronize with
      memory hotplug - taking the slab_mutex does not eliminate the
      possibility of race as described in patch 2.
      
      What we need there is something like get_online_cpus, but for memory.
      We already have lock_memory_hotplug, which serves for the purpose, but
      it's a bit of a hammer right now, because it's backed by a mutex.  As a
      result, it imposes some limitations to locking order, which are not
      desirable, and can't be used just like get_online_cpus.  That's why in
      patch 1 I substitute it with get/put_online_mems, which work exactly
      like get/put_online_cpus except they block not cpu, but memory hotplug.
      
      [ v1 can be found at https://lkml.org/lkml/2014/4/6/68.  I NAK'ed it by
        myself, because it used an rw semaphore for get/put_online_mems,
        making them dead lock prune.  ]
      
      This patch (of 2):
      
      {un}lock_memory_hotplug, which is used to synchronize against memory
      hotplug, is currently backed by a mutex, which makes it a bit of a
      hammer - threads that only want to get a stable value of online nodes
      mask won't be able to proceed concurrently.  Also, it imposes some
      strong locking ordering rules on it, which narrows down the set of its
      usage scenarios.
      
      This patch introduces get/put_online_mems, which are the same as
      get/put_online_cpus, but for memory hotplug, i.e.  executing a code
      inside a get/put_online_mems section will guarantee a stable value of
      online nodes, present pages, etc.
      
      lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfc8c901
    • M
      mm: page_alloc: do not cache reclaim distances · 5f7a75ac
      Mel Gorman 提交于
      pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed
      by zone_reclaim due to its distance.  As it is expected that
      zone_reclaim_mode will be rarely enabled it is unreasonable for all
      machines to take a penalty.  Fortunately, the zone_reclaim_mode() path
      is already slow and it is the path that takes the hit.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f7a75ac
    • M
      mm: disable zone_reclaim_mode by default · 4f9b16a6
      Mel Gorman 提交于
      When it was introduced, zone_reclaim_mode made sense as NUMA distances
      punished and workloads were generally partitioned to fit into a NUMA
      node.  NUMA machines are now common but few of the workloads are
      NUMA-aware and it's routine to see major performance degradation due to
      zone_reclaim_mode being enabled but relatively few can identify the
      problem.
      
      Those that require zone_reclaim_mode are likely to be able to detect
      when it needs to be enabled and tune appropriately so lets have a
      sensible default for the bulk of users.
      
      This patch (of 2):
      
      zone_reclaim_mode causes processes to prefer reclaiming memory from
      local node instead of spilling over to other nodes.  This made sense
      initially when NUMA machines were almost exclusively HPC and the
      workload was partitioned into nodes.  The NUMA penalties were
      sufficiently high to justify reclaiming the memory.  On current machines
      and workloads it is often the case that zone_reclaim_mode destroys
      performance but not all users know how to detect this.  Favour the
      common case and disable it by default.  Users that are sophisticated
      enough to know they need zone_reclaim_mode will detect it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9b16a6
    • L
      hugetlb: add hstate_is_gigantic() · bae7f4ae
      Luiz Capitulino 提交于
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bae7f4ae
    • D
      mm: pass VM_BUG_ON() reason to dump_page() · e4f67422
      Dave Hansen 提交于
      I recently added a patch to let folks pass a "reason" string dump_page()
      which gets dumped out along with the page's data.  This essentially
      saves the bug-reader a trip in to the source to figure out why we
      BUG_ON()'d.
      
      The new VM_BUG_ON_PAGE() passes in NULL for "reason".  It seems like we
      might as well pass the BUG_ON() condition if we have it.  This will
      bloat kernels a bit with ~160 new strings, but this is all under a
      debugging option anyway.
      
      	page:ffffea0008560280 count:1 mapcount:0 mapping:(null) index:0x0
      	page flags: 0xbfffc0000000001(locked)
      	page dumped because: VM_BUG_ON_PAGE(PageLocked(page))
      	------------[ cut here ]------------
      	kernel BUG at /home/davehans/linux.git/mm/filemap.c:464!
      	invalid opcode: 0000 [#1] SMP
      	CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.14.0+ #251
      	Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      	...
      
      [akpm@linux-foundation.org: include stringify.h]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4f67422
    • A
      include/linux/mmdebug.h: add VM_WARN_ON() and VM_WARN_ON_ONCE() · 02a8efed
      Andrew Morton 提交于
      WARN_ON() and WARN_ON_ONCE(), dependent on CONFIG_DEBUG_VM
      
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02a8efed
    • A
      cma: add placement specifier for "cma=" kernel parameter · 5ea3b1b2
      Akinobu Mita 提交于
      Currently, "cma=" kernel parameter is used to specify the size of CMA,
      but we can't specify where it is located.  We want to locate CMA below
      4GB for devices only supporting 32-bit addressing on 64-bit systems
      without iommu.
      
      This enables to specify the placement of CMA by extending "cma=" kernel
      parameter.
      
      Examples:
       1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
       2. locate 64MB CMA exact at 512MB by "cma=64M@512M"
      
      Note that the DMA contiguous memory allocator on x86 assumes that
      page_address() works for the pages to allocate.  So this change requires
      to limit end address of contiguous memory area upto max_pfn_mapped to
      prevent from locating it on highmem area by the argument of
      dma_contiguous_reserve().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ea3b1b2
    • A
      memblock: introduce memblock_alloc_range() · 2bfc2862
      Akinobu Mita 提交于
      This introduces memblock_alloc_range() which allocates memblock from the
      specified range of physical address.  I would like to use this function
      to specify the location of CMA.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bfc2862
    • A
      x86: enable DMA CMA with swiotlb · 9c5a3621
      Akinobu Mita 提交于
      The DMA Contiguous Memory Allocator support on x86 is disabled when
      swiotlb config option is enabled.  So DMA CMA is always disabled on
      x86_64 because swiotlb is always enabled.  This attempts to support for
      DMA CMA with enabling swiotlb config option.
      
      The contiguous memory allocator on x86 is integrated in the function
      dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
      for dma_alloc_coherent().
      
      x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
      tries to allocate with dma_generic_alloc_coherent() firstly and then
      swiotlb_alloc_coherent() is called as a fallback.
      
      The main part of supporting DMA CMA with swiotlb is that changing
      x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
      for dma_free_coherent() so that it can distinguish memory allocated by
      dma_generic_alloc_coherent() from one allocated by
      swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
      which can handle contiguous memory.  This change requires making
      is_swiotlb_buffer() global function.
      
      This also needs to change .free callback in the dma_map_ops for amd_gart
      and sta2x11, because these dma_ops are also using
      dma_generic_alloc_coherent().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5a3621
    • D
      mm,vmacache: add debug data · 4f115147
      Davidlohr Bueso 提交于
      Introduce a CONFIG_DEBUG_VM_VMACACHE option to enable counting the cache
      hit rate -- exported in /proc/vmstat.
      
      Any updates to the caching scheme needs this kind of data, thus it can
      save some work re-implementing the counting all the time.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f115147
    • V
      mm: get rid of __GFP_KMEMCG · 52383431
      Vladimir Davydov 提交于
      Currently to allocate a page that should be charged to kmemcg (e.g.
      threadinfo), we pass __GFP_KMEMCG flag to the page allocator.  The page
      allocated is then to be freed by free_memcg_kmem_pages.  Apart from
      looking asymmetrical, this also requires intrusion to the general
      allocation path.  So let's introduce separate functions that will
      alloc/free pages charged to kmemcg.
      
      The new functions are called alloc_kmem_pages and free_kmem_pages.  They
      should be used when the caller actually would like to use kmalloc, but
      has to fall back to the page allocator for the allocation is large.
      They only differ from alloc_pages and free_pages in that besides
      allocating or freeing pages they also charge them to the kmem resource
      counter of the current memory cgroup.
      
      [sfr@canb.auug.org.au: export kmalloc_order() to modules]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52383431
    • V
      sl[au]b: charge slabs to kmemcg explicitly · 5dfb4175
      Vladimir Davydov 提交于
      We have only a few places where we actually want to charge kmem so
      instead of intruding into the general page allocation path with
      __GFP_KMEMCG it's better to explictly charge kmem there.  All kmem
      charges will be easier to follow that way.
      
      This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
      from memcg caches' allocflags.  Instead it makes slab allocation path
      call memcg_charge_kmem directly getting memcg to charge from the cache's
      memcg params.
      
      This also eliminates any possibility of misaccounting an allocation
      going from one memcg's cache to another memcg, because now we always
      charge slabs against the memcg the cache belongs to.  That's why this
      patch removes the big comment to memcg_kmem_get_cache.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dfb4175
    • M
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman 提交于
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
    • F
      fs/libfs.c: add generic data flush to fsync · ac13a829
      Fabian Frederick 提交于
      Description by Jan Kara:
       "A lot of older filesystems don't properly flush volatile disk caches
        on fsync(2) which can lead to loss of fsynced data after power failure.
      
      This patch makes generic_file_fsync() issue proper cache flush to fix the
      problem.  Sysadmin can use /sys/devices/.../cache_type to tell the system
      it should not send the cache flush."
      
      [akpm@linux-foundation.org: nuke ifdef]
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Suggested-by: NJan Kara <jack@suse.cz>
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac13a829
    • N
      hugetlb: restrict hugepage_migration_support() to x86_64 · c177c81e
      Naoya Horiguchi 提交于
      Currently hugepage migration is available for all archs which support
      pmd-level hugepage, but testing is done only for x86_64 and there're
      bugs for other archs.  So to avoid breaking such archs, this patch
      limits the availability strictly to x86_64 until developers of other
      archs get interested in enabling this feature.
      
      Simply disabling hugepage migration on non-x86_64 archs is not enough to
      fix the reported problem where sys_move_pages() hits the BUG_ON() in
      follow_page(FOLL_GET), so let's fix this by checking if hugepage
      migration is supported in vma_migratable().
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c177c81e
  2. 04 6月, 2014 1 次提交
    • A
      of/irq: provide more wrappers for !CONFIG_OF · 64c5c759
      Arnd Bergmann 提交于
      The pci-rcar driver is enabled for compile tests, and this has
      now shown that the driver cannot build without CONFIG_OF,
      following the inclusion of f8f2fe73 "PCI: rcar: Use new OF
      interrupt mapping when possible":
      
      drivers/built-in.o: In function `rcar_pci_map_irq':
      :(.text+0x1cc7c): undefined reference to `of_irq_parse_and_map_pci'
      pci/host/pcie-rcar.c: In function 'pci_dma_range_parser_init':
      pci/host/pcie-rcar.c:875:2: error: implicit declaration of function 'of_n_addr_cells' [-Werror=implicit-function-declaration]
      
      As pointed out by Ben Dooks and Geert Uytterhoeven, this is actually
      supposed to build fine, which we can achieve if we make the
      declaration of of_irq_parse_and_map_pci conditional on CONFIG_OF
      and provide an empty inline function otherwise, as we do for
      a lot of other of interfaces.
      
      This lets us build the rcar_pci driver again without CONFIG_OF
      for build testing. All platforms using this driver select OF,
      so this doesn't change anything for the users.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: devicetree@vger.kernel.org
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Magnus Damm <damm@opensource.se>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ben Dooks <ben.dooks@codethink.co.uk>
      Cc: linux-pci@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      [robh: drop wrappers for of_n_addr_cells and of_n_size_cells which are
      low-level functions that should not be used for !OF]
      Signed-off-by: NRob Herring <robh@kernel.org>
      64c5c759
  3. 03 6月, 2014 2 次提交
  4. 02 6月, 2014 5 次提交