1. 02 6月, 2015 2 次提交
    • T
      memcg: add mem_cgroup_root_css · 56161634
      Tejun Heo 提交于
      Add global mem_cgroup_root_css which points to the root memcg css.
      This will be used by cgroup writeback support.  If memcg is disabled,
      it's defined as ERR_PTR(-EINVAL).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      aCc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      56161634
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  2. 16 4月, 2015 3 次提交
  3. 15 4月, 2015 3 次提交
  4. 13 3月, 2015 1 次提交
  5. 01 3月, 2015 2 次提交
    • J
      mm: memcontrol: use "max" instead of "infinity" in control knobs · d2973697
      Johannes Weiner 提交于
      The memcg control knobs indicate the highest possible value using the
      symbolic name "infinity", which is long and awkward to type.
      
      Switch to the string "max", which is just as descriptive but shorter and
      sweeter.
      
      This changes a user interface, so do it before the release and before
      the development flag is dropped from the default hierarchy.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2973697
    • M
      memcg: fix low limit calculation · 4e54dede
      Michal Hocko 提交于
      A memcg is considered low limited even when the current usage is equal to
      the low limit.  This leads to interesting side effects e.g.
      groups/hierarchies with no memory accounted are considered protected and
      so the reclaim will emit MEMCG_LOW event when encountering them.
      
      Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
      NULL ptr dereference with the legacy cgroup API which even doesn't have
      low limit exposed.  The limit is 0 by default but the initial check fails
      for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
      use_hierarchy is 0 and so page_counter_read would try to dereference NULL.
      
      I suppose that the current implementation is just an overlook because the
      documentation in Documentation/cgroups/unified-hierarchy.txt says:
      
        "The memory.low boundary on the other hand is a top-down allocated
        reserve.  A cgroup enjoys reclaim protection when it and all its
        ancestors are below their low boundaries"
      
      Fix the usage and the low limit comparision in mem_cgroup_low accordingly.
      
      Fixes: 241994ed (mm: memcontrol: default hierarchy interface for memory)
      Reported-by: NJoonsoo Kim <js1304@gmail.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e54dede
  6. 13 2月, 2015 8 次提交
    • V
      memcg: cleanup static keys decrement · f48b80a5
      Vladimir Davydov 提交于
      Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
      from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
      wrapper functions.
      
      Although this patch moves static keys decrement from __mem_cgroup_free to
      mem_cgroup_css_free, it does not introduce any functional changes, because
      the keys are incremented on setting the limit (tcp or kmem), which can
      only happen after successful mem_cgroup_css_online.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f48b80a5
    • V
      memcg: reparent list_lrus and free kmemcg_id on css offline · 2788cf0c
      Vladimir Davydov 提交于
      Now, the only reason to keep kmemcg_id till css free is list_lru, which
      uses it to distribute elements between per-memcg lists.  However, it can
      be easily sorted out - we only need to change kmemcg_id of an offline
      cgroup to its parent's id, making further list_lru_add()'s add elements to
      the parent's list, and then move all elements from the offline cgroup's
      list to the one of its parent.  It will work, because a racing
      list_lru_del() does not need to know the list it is deleting the element
      from.  It can decrement the wrong nr_items counter though, but the ongoing
      reparenting will fix it.  After list_lru reparenting is done we are free
      to release kmemcg_id saving a valuable slot in a per-memcg array for new
      cgroups.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2788cf0c
    • V
      memcg: free memcg_caches slot on css offline · 2a4db7eb
      Vladimir Davydov 提交于
      We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
      on allocations, so there is no need to have the array entries set until
      css free - we can clear them on css offline.  This will allow us to reuse
      array entries more efficiently and avoid costly array relocations.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a4db7eb
    • V
      slab: embed memcg_cache_params to kmem_cache · f7ce3190
      Vladimir Davydov 提交于
      Currently, kmem_cache stores a pointer to struct memcg_cache_params
      instead of embedding it.  The rationale is to save memory when kmem
      accounting is disabled.  However, the memcg_cache_params has shrivelled
      drastically since it was first introduced:
      
      * Initially:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct kmem_cache *memcg_caches[0];
      		struct {
      			struct mem_cgroup *memcg;
      			struct list_head list;
      			struct kmem_cache *root_cache;
      			bool dead;
      			atomic_t nr_pages;
      			struct work_struct destroy;
      		};
      	};
      };
      
      * Now:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct {
      			struct rcu_head rcu_head;
      			struct kmem_cache *memcg_caches[0];
      		};
      		struct {
      			struct mem_cgroup *memcg;
      			struct kmem_cache *root_cache;
      		};
      	};
      };
      
      So the memory saving does not seem to be a clear win anymore.
      
      OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
      it results in touching one more cache line on kmem alloc/free hot paths.
      Besides, it makes linking kmem caches in a list chained by a field of
      struct memcg_cache_params really painful due to a level of indirection,
      while I want to make them linked in the following patch.  That said, let
      us embed it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ce3190
    • V
      list_lru: introduce per-memcg lists · 60d3fd32
      Vladimir Davydov 提交于
      There are several FS shrinkers, including super_block::s_shrink, that
      keep reclaimable objects in the list_lru structure.  Hence to turn them
      to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
      
      This patch does the trick.  It adds an array of lru lists to the
      list_lru_node structure (per-node part of the list_lru), one for each
      kmem-active memcg, and dispatches every item addition or removal to the
      list corresponding to the memcg which the item is accounted to.  So now
      the list_lru structure is not just per node, but per node and per memcg.
      
      Not all list_lrus need this feature, so this patch also adds a new
      method, list_lru_init_memcg, which initializes a list_lru as memcg
      aware.  Otherwise (i.e.  if initialized with old list_lru_init), the
      list_lru won't have per memcg lists.
      
      Just like per memcg caches arrays, the arrays of per-memcg lists are
      indexed by memcg_cache_id, so we must grow them whenever
      memcg_nr_cache_ids is increased.  So we introduce a callback,
      memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
      space is full.
      
      The locking is implemented in a manner similar to lruvecs, i.e.  we have
      one lock per node that protects all lists (both global and per cgroup) on
      the node.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60d3fd32
    • V
      memcg: add rwsem to synchronize against memcg_caches arrays relocation · 05257a1a
      Vladimir Davydov 提交于
      We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
      (memcg_alloc_cache_params() wants it for root caches), where we only
      hold the slab_mutex and no memcg-related locks.  As a result, we have to
      update memcg_nr_cache_ids under the slab_mutex, which we can only take
      on the slab's side (see memcg_update_array_size).  This looks awkward
      and will become even worse when per-memcg list_lru is introduced, which
      also wants stable access to memcg_nr_cache_ids.
      
      To get rid of this dependency between the memcg_nr_cache_ids and the
      slab_mutex, this patch introduces a special rwsem.  The rwsem is held
      for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
      updates.  Therefore one can take it for reading to get a stable access
      to memcg_caches arrays and/or memcg_nr_cache_ids.
      
      Currently the semaphore is taken for reading only from
      kmem_cache_create, right before taking the slab_mutex, so right now
      there's no much point in using rwsem instead of mutex.  However, once
      list_lru is made per-memcg it will allow list_lru initializations to
      proceed concurrently.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05257a1a
    • V
      memcg: rename some cache id related variables · dbcf73e2
      Vladimir Davydov 提交于
      memcg_limited_groups_array_size, which defines the size of memcg_caches
      arrays, sounds rather cumbersome.  Also it doesn't point anyhow that
      it's related to kmem/caches stuff.  So let's rename it to
      memcg_nr_cache_ids.  It's concise and points us directly to
      memcg_cache_id.
      
      Also, rename kmem_limited_groups to memcg_cache_ida.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbcf73e2
    • V
      vmscan: per memory cgroup slab shrinkers · cb731d6c
      Vladimir Davydov 提交于
      This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
      set, it will be called per memory cgroup.  The memory cgroup to scan
      objects from is passed in shrink_control->memcg.  If the memory cgroup
      is NULL, a memcg aware shrinker is supposed to scan objects from the
      global list.  Unaware shrinkers are only called on global pressure with
      memcg=NULL.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb731d6c
  7. 12 2月, 2015 12 次提交
    • N
      memcg: cleanup preparation for page table walk · 26bcd64a
      Naoya Horiguchi 提交于
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And both of mem_cgroup_count_precharge() and
      mem_cgroup_move_charge() do for each vma loop themselves, but now it's
      done in pagewalk.c, so let's clean up them.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26bcd64a
    • J
      mm: memcontrol: consolidate swap controller code · 21afa38e
      Johannes Weiner 提交于
      The swap controller code is scattered all over the file.  Gather all
      the code that isn't directly needed by the memory controller at the
      end of the file in its own CONFIG_MEMCG_SWAP section.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21afa38e
    • J
      mm: memcontrol: consolidate memory controller initialization · 95a045f6
      Johannes Weiner 提交于
      The initialization code for the per-cpu charge stock and the soft
      limit tree is compact enough to inline it into mem_cgroup_init().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95a045f6
    • J
      mm: memcontrol: simplify soft limit tree init code · 9c608dbe
      Johannes Weiner 提交于
      - No need to test the node for N_MEMORY.  node_online() is enough for
        node fallback to work in slab, use NUMA_NO_NODE for everything else.
      
      - Remove the BUG_ON() for allocation failure.  A NULL pointer crash is
        just as descriptive, and the absent return value check is obvious.
      
      - Move local variables to the inner-most blocks.
      
      - Point to the tree structure after its initialized, not before, it's
        just more logical that way.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c608dbe
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • M
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko 提交于
      This patchset addresses a race which was described in the changelog for
      5695be14 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
    • J
      mm: memcontrol: fold move_anon() and move_file() · 1dfab5ab
      Johannes Weiner 提交于
      Turn the move type enum into flags and give the flags field a shorter
      name.  Once that is done, move_anon() and move_file() are simple enough to
      just fold them into the callsites.
      
      [akpm@linux-foundation.org: tweak MOVE_MASK definition, per Michal]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dfab5ab
    • J
      mm: memcontrol: default hierarchy interface for memory · 241994ed
      Johannes Weiner 提交于
      Introduce the basic control files to account, partition, and limit
      memory using cgroups in default hierarchy mode.
      
      This interface versioning allows us to address fundamental design
      issues in the existing memory cgroup interface, further explained
      below.  The old interface will be maintained indefinitely, but a
      clearer model and improved workload performance should encourage
      existing users to switch over to the new one eventually.
      
      The control files are thus:
      
        - memory.current shows the current consumption of the cgroup and its
          descendants, in bytes.
      
        - memory.low configures the lower end of the cgroup's expected
          memory consumption range.  The kernel considers memory below that
          boundary to be a reserve - the minimum that the workload needs in
          order to make forward progress - and generally avoids reclaiming
          it, unless there is an imminent risk of entering an OOM situation.
      
        - memory.high configures the upper end of the cgroup's expected
          memory consumption range.  A cgroup whose consumption grows beyond
          this threshold is forced into direct reclaim, to work off the
          excess and to throttle new allocations heavily, but is generally
          allowed to continue and the OOM killer is not invoked.
      
        - memory.max configures the hard maximum amount of memory that the
          cgroup is allowed to consume before the OOM killer is invoked.
      
        - memory.events shows event counters that indicate how often the
          cgroup was reclaimed while below memory.low, how often it was
          forced to reclaim excess beyond memory.high, how often it hit
          memory.max, and how often it entered OOM due to memory.max.  This
          allows users to identify configuration problems when observing a
          degradation in workload performance.  An overcommitted system will
          have an increased rate of low boundary breaches, whereas increased
          rates of high limit breaches, maximum hits, or even OOM situations
          will indicate internally overcommitted cgroups.
      
      For existing users of memory cgroups, the following deviations from
      the current interface are worth pointing out and explaining:
      
        - The original lower boundary, the soft limit, is defined as a limit
          that is per default unset.  As a result, the set of cgroups that
          global reclaim prefers is opt-in, rather than opt-out.  The costs
          for optimizing these mostly negative lookups are so high that the
          implementation, despite its enormous size, does not even provide
          the basic desirable behavior.  First off, the soft limit has no
          hierarchical meaning.  All configured groups are organized in a
          global rbtree and treated like equal peers, regardless where they
          are located in the hierarchy.  This makes subtree delegation
          impossible.  Second, the soft limit reclaim pass is so aggressive
          that it not just introduces high allocation latencies into the
          system, but also impacts system performance due to overreclaim, to
          the point where the feature becomes self-defeating.
      
          The memory.low boundary on the other hand is a top-down allocated
          reserve.  A cgroup enjoys reclaim protection when it and all its
          ancestors are below their low boundaries, which makes delegation
          of subtrees possible.  Secondly, new cgroups have no reserve per
          default and in the common case most cgroups are eligible for the
          preferred reclaim pass.  This allows the new low boundary to be
          efficiently implemented with just a minor addition to the generic
          reclaim code, without the need for out-of-band data structures and
          reclaim passes.  Because the generic reclaim code considers all
          cgroups except for the ones running low in the preferred first
          reclaim pass, overreclaim of individual groups is eliminated as
          well, resulting in much better overall workload performance.
      
        - The original high boundary, the hard limit, is defined as a strict
          limit that can not budge, even if the OOM killer has to be called.
          But this generally goes against the goal of making the most out of
          the available memory.  The memory consumption of workloads varies
          during runtime, and that requires users to overcommit.  But doing
          that with a strict upper limit requires either a fairly accurate
          prediction of the working set size or adding slack to the limit.
          Since working set size estimation is hard and error prone, and
          getting it wrong results in OOM kills, most users tend to err on
          the side of a looser limit and end up wasting precious resources.
      
          The memory.high boundary on the other hand can be set much more
          conservatively.  When hit, it throttles allocations by forcing
          them into direct reclaim to work off the excess, but it never
          invokes the OOM killer.  As a result, a high boundary that is
          chosen too aggressively will not terminate the processes, but
          instead it will lead to gradual performance degradation.  The user
          can monitor this and make corrections until the minimal memory
          footprint that still gives acceptable performance is found.
      
          In extreme cases, with many concurrent allocations and a complete
          breakdown of reclaim progress within the group, the high boundary
          can be exceeded.  But even then it's mostly better to satisfy the
          allocation from the slack available in other groups or the rest of
          the system than killing the group.  Otherwise, memory.max is there
          to limit this type of spillover and ultimately contain buggy or
          even malicious applications.
      
        - The original control file names are unwieldy and inconsistent in
          many different ways.  For example, the upper boundary hit count is
          exported in the memory.failcnt file, but an OOM event count has to
          be manually counted by listening to memory.oom_control events, and
          lower boundary / soft limit events have to be counted by first
          setting a threshold for that value and then counting those events.
          Also, usage and limit files encode their units in the filename.
          That makes the filenames very long, even though this is not
          information that a user needs to be reminded of every time they
          type out those names.
      
          To address these naming issues, as well as to signal clearly that
          the new interface carries a new configuration model, the naming
          conventions in it necessarily differ from the old interface.
      
        - The original limit files indicate the state of an unset limit with
          a very high number, and a configured limit can be unset by echoing
          -1 into those files.  But that very high number is implementation
          and architecture dependent and not very descriptive.  And while -1
          can be understood as an underflow into the highest possible value,
          -2 or -10M etc. do not work, so it's not inconsistent.
      
          memory.low, memory.high, and memory.max will use the string
          "infinity" to indicate and set the highest possible value.
      
      [akpm@linux-foundation.org: use seq_puts() for basic strings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      241994ed
    • J
      mm: page_counter: pull "-1" handling out of page_counter_memparse() · 650c5e56
      Johannes Weiner 提交于
      The unified hierarchy interface for memory cgroups will no longer use "-1"
      to mean maximum possible resource value.  In preparation for this, make
      the string an argument and let the caller supply it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      650c5e56
    • G
      memcg: add BUILD_BUG_ON() for string tables · 0ca44b14
      Greg Thelen 提交于
      Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync
      with corresponding enums.  There aren't currently any issues with these
      tables.  This is just defensive.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ca44b14
    • V
      vmscan: force scan offline memory cgroups · 90cbc250
      Vladimir Davydov 提交于
      Since commit b2052564 ("mm: memcontrol: continue cache reclaim from
      offlined groups") pages charged to a memory cgroup are not reparented when
      the cgroup is removed.  Instead, they are supposed to be reclaimed in a
      regular way, along with pages accounted to online memory cgroups.
      
      However, an lruvec of an offline memory cgroup will sooner or later get so
      small that it will be scanned only at low scan priorities (see
      get_scan_count()).  Therefore, if there are enough reclaimable pages in
      big lruvecs, pages accounted to offline memory cgroups will never be
      scanned at all, wasting memory.
      
      Fix this by unconditionally forcing scanning dead lruvecs from kswapd.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90cbc250
    • J
      mm: memcontrol: track move_lock state internally · 6de22619
      Johannes Weiner 提交于
      The complexity of memcg page stat synchronization is currently leaking
      into the callsites, forcing them to keep track of the move_lock state and
      the IRQ flags.  Simplify the API by tracking it in the memcg.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de22619
  8. 11 2月, 2015 4 次提交
  9. 06 2月, 2015 1 次提交
  10. 27 1月, 2015 1 次提交
  11. 09 1月, 2015 2 次提交
  12. 14 12月, 2014 1 次提交