1. 08 2月, 2014 1 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
  2. 31 1月, 2014 1 次提交
  3. 24 1月, 2014 17 次提交
    • V
      memcg: remove unused code from kmem_cache_destroy_work_func · 0d8a4a37
      Vladimir Davydov 提交于
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d8a4a37
    • M
      memcg: fix css reference leak and endless loop in mem_cgroup_iter · 0eef6156
      Michal Hocko 提交于
      Commit 19f39402 ("memcg: simplify mem_cgroup_iter") has reorganized
      mem_cgroup_iter code in order to simplify it.  A part of that change was
      dropping an optimization which didn't call css_tryget on the root of the
      walked tree.  The patch however didn't change the css_put part in
      mem_cgroup_iter which excludes root.
      
      This wasn't an issue at the time because __mem_cgroup_iter_next bailed
      out for root early without taking a reference as cgroup iterators
      (css_next_descendant_pre) didn't visit root themselves.
      
      Nevertheless cgroup iterators have been reworked to visit root by commit
      bd8815a6 ("cgroup: make css_for_each_descendant() and friends
      include the origin css in the iteration") when the root bypass have been
      dropped in __mem_cgroup_iter_next.  This means that css_put is not
      called for root and so css along with mem_cgroup and other cgroup
      internal object tied by css lifetime are never freed.
      
      Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
      do not take css reference for it.
      
      This reference counting magic protects us also from another issue, an
      endless loop reported by Hugh Dickins when reclaim races with root
      removal and css_tryget called by iterator internally would fail.  There
      would be no other nodes to visit so __mem_cgroup_iter_next would return
      NULL and mem_cgroup_iter would interpret it as "start looping from root
      again" and so mem_cgroup_iter would loop forever internally.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eef6156
    • M
      memcg: fix endless loop caused by mem_cgroup_iter · ecc736fc
      Michal Hocko 提交于
      Hugh has reported an endless loop when the hardlimit reclaim sees the
      same group all the time.  This might happen when the reclaim races with
      the memcg removal.
      
      shrink_zone
                                                      [rmdir root]
        mem_cgroup_iter(root, NULL, reclaim)
          // prev = NULL
          rcu_read_lock()
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root || NULL
            css_tryget(last_visited)            // failed
            last_visited = NULL                 [1]
          memcg = root = __mem_cgroup_iter_next(root, NULL)
          mem_cgroup_iter_update
            iter->last_visited = root;
          reclaim->generation = iter->generation
      
       mem_cgroup_iter(root, root, reclaim)
         // prev = root
         rcu_read_lock
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root
            css_tryget(last_visited)            // failed
          [1]
      
      The issue seemed to be introduced by commit 5f578161 ("memcg: relax
      memcg iter caching") which has replaced unconditional css_get/css_put by
      css_tryget/css_put for the cached iterator.
      
      This patch fixes the issue by skipping css_tryget on the root of the
      tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
      in mem_cgroup_iter_update.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecc736fc
    • D
      mm, oom: prefer thread group leaders for display purposes · d49ad935
      David Rientjes 提交于
      When two threads have the same badness score, it's preferable to kill
      the thread group leader so that the actual process name is printed to
      the kernel log rather than the thread group name which may be shared
      amongst several processes.
      
      This was the behavior when select_bad_process() used to do
      for_each_process(), but it now iterates threads instead and leads to
      ambiguity.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49ad935
    • H
      mm/memcg: iteration skip memcgs not yet fully initialized · d8ad3055
      Hugh Dickins 提交于
      It is surprising that the mem_cgroup iterator can return memcgs which
      have not yet been fully initialized.  By accident (or trial and error?)
      this appears not to present an actual problem; but it may be better to
      prevent such surprises, by skipping memcgs not yet online.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ad3055
    • H
      mm/memcg: fix last_dead_count memory wastage · d2ab70aa
      Hugh Dickins 提交于
      Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
      int: it's assigned from an int and compared with an int, and adjacent to
      an unsigned int: so there's no point to it being unsigned long, which
      wasted 104 bytes in every mem_cgroup_per_zone.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ab70aa
    • V
      memcg: rework memcg_update_kmem_limit synchronization · d6441637
      Vladimir Davydov 提交于
      Currently we take both the memcg_create_mutex and the set_limit_mutex
      when we enable kmem accounting for a memory cgroup, which makes kmem
      activation events serialize with both memcg creations and other memcg
      limit updates (memory.limit, memory.memsw.limit).  However, there is no
      point in such strict synchronization rules there.
      
      First, the set_limit_mutex was introduced to keep the memory.limit and
      memory.memsw.limit values in sync.  Since memory.kmem.limit can be set
      independently of them, it is better to introduce a separate mutex to
      synchronize against concurrent kmem limit updates.
      
      Second, we take the memcg_create_mutex in order to make sure all
      children of this memcg will be kmem-active as well.  For achieving that,
      it is enough to hold this mutex only while checking if
      memcg_has_children() though.  This guarantees that if a child is added
      after we checked that the memcg has no children, the newly added cgroup
      will see its parent kmem-active (of course if the latter succeeded), and
      call kmem activation for itself.
      
      This patch simplifies the locking rules of memcg_update_kmem_limit()
      according to these considerations.
      
      [vdavydov@parallels.com: fix unintialized var warning]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6441637
    • V
      memcg: remove KMEM_ACCOUNTED_ACTIVATED flag · 6de64beb
      Vladimir Davydov 提交于
      Currently we have two state bits in mem_cgroup::kmem_account_flags
      regarding kmem accounting activation, ACTIVATED and ACTIVE.  We start
      kmem accounting only if both flags are set (memcg_can_account_kmem()),
      plus throughout the code there are several places where we check only
      the ACTIVE flag, but we never check the ACTIVATED flag alone.  These
      flags are both set from memcg_update_kmem_limit() under the
      set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
      they never get cleared.  That said checking if both flags are set is
      equivalent to checking only for the ACTIVE flag, and since there is no
      ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
      nothing will change.
      
      Let's try to understand what was the reason for introducing these flags.
      The purpose of the ACTIVE flag is clear - it states that kmem should be
      accounting to the cgroup.  The only requirement for it is that it should
      be set after we have fully initialized kmem accounting bits for the
      cgroup and patched all static branches relating to kmem accounting.
      Since we always check if static branch is enabled before actually
      considering if we should account (otherwise we wouldn't benefit from
      static branching), this guarantees us that we won't skip a commit or
      uncharge after a charge due to an unpatched static branch.
      
      Now let's move on to the ACTIVATED bit.  As I proved in the beginning of
      this message, it is absolutely useless, and removing it will change
      nothing.  So what was the reason introducing it?
      
      The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use
      static branches when code not in use") in order to guarantee that
      static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
      for each memory cgroup when its kmem accounting was activated.  The
      point was that at that time the memcg_update_kmem_limit() function's
      work-flow looked like this:
      
              bool must_inc_static_branch = false;
      
              cgroup_lock();
              mutex_lock(&set_limit_mutex);
              if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                      /* The kmem limit is set for the first time */
                      ret = res_counter_set_limit(&memcg->kmem, val);
      
                      memcg_kmem_set_activated(memcg);
                      must_inc_static_branch = true;
              } else
                      ret = res_counter_set_limit(&memcg->kmem, val);
              mutex_unlock(&set_limit_mutex);
              cgroup_unlock();
      
              if (must_inc_static_branch) {
                      /* We can't do this under cgroup_lock */
                      static_key_slow_inc(&memcg_kmem_enabled_key);
                      memcg_kmem_set_active(memcg);
              }
      
      So that without the ACTIVATED flag we could race with other threads
      trying to set the limit and increment the static branching ref-counter
      more than once.  Today we call the whole memcg_update_kmem_limit()
      function under the set_limit_mutex and this race is impossible.
      
      As now we understand why the ACTIVATED bit was introduced and why we
      don't need it now, and know that removing it will change nothing anyway,
      let's get rid of it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de64beb
    • V
      memcg, slab: RCU protect memcg_params for root caches · f8570263
      Vladimir Davydov 提交于
      We relocate root cache's memcg_params whenever we need to grow the
      memcg_caches array to accommodate all kmem-active memory cgroups.
      Currently on relocation we free the old version immediately, which can
      lead to use-after-free, because the memcg_caches array is accessed
      lock-free (see cache_from_memcg_idx()).  This patch fixes this by making
      memcg_params RCU-protected for root caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8570263
    • V
      memcg: get rid of kmem_cache_dup() · 842e2873
      Vladimir Davydov 提交于
      kmem_cache_dup() is only called from memcg_create_kmem_cache().  The
      latter, in fact, does nothing besides this, so let's fold
      kmem_cache_dup() into memcg_create_kmem_cache().
      
      This patch also makes the memcg_cache_mutex private to
      memcg_create_kmem_cache(), because it is not used anywhere else.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842e2873
    • V
      memcg, slab: fix races in per-memcg cache creation/destruction · 2edefe11
      Vladimir Davydov 提交于
      We obtain a per-memcg cache from a root kmem_cache by dereferencing an
      entry of the root cache's memcg_params::memcg_caches array.  If we find
      no cache for a memcg there on allocation, we initiate the memcg cache
      creation (see memcg_kmem_get_cache()).  The cache creation proceeds
      asynchronously in memcg_create_kmem_cache() in order to avoid lock
      clashes, so there can be several threads trying to create the same
      kmem_cache concurrently, but only one of them may succeed.  However, due
      to a race in the code, it is not always true.  The point is that the
      memcg_caches array can be relocated when we activate kmem accounting for
      a memcg (see memcg_update_all_caches(), memcg_update_cache_size()).  If
      memcg_update_cache_size() and memcg_create_kmem_cache() proceed
      concurrently as described below, we can leak a kmem_cache.
      
      Asume two threads schedule creation of the same kmem_cache.  One of them
      successfully creates it.  Another one should fail then, but if
      memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
      follows, it won't:
      
        memcg_create_kmem_cache()             memcg_update_cache_size()
        (called w/o mutexes held)             (called with slab_mutex,
                                               set_limit_mutex held)
        -------------------------             -------------------------
      
        mutex_lock(&memcg_cache_mutex)
      
                                              s->memcg_params=kzalloc(...)
      
        new_cachep=cache_from_memcg_idx(cachep,idx)
        // new_cachep==NULL => proceed to creation
      
                                              s->memcg_params->memcg_caches[i]
                                                  =cur_params->memcg_caches[i]
      
        // kmem_cache_create_memcg takes slab_mutex
        // so we will hang around until
        // memcg_update_cache_size finishes, but
        // nothing will prevent it from succeeding so
        // memcg_caches[idx] will be overwritten in
        // memcg_register_cache!
      
        new_cachep = kmem_cache_create_memcg(...)
        mutex_unlock(&memcg_cache_mutex)
      
      Let's fix this by moving the check for existence of the memcg cache to
      kmem_cache_create_memcg() to be called under the slab_mutex and make it
      return NULL if so.
      
      A similar race is possible when destroying a memcg cache (see
      kmem_cache_destroy()).  Since memcg_unregister_cache(), which clears the
      pointer in the memcg_caches array, is called w/o protection, we can race
      with memcg_update_cache_size() and omit clearing the pointer.  Therefore
      memcg_unregister_cache() should be moved before we release the
      slab_mutex.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2edefe11
    • V
      memcg: fix possible NULL deref while traversing memcg_slab_caches list · 96403da2
      Vladimir Davydov 提交于
      All caches of the same memory cgroup are linked in the memcg_slab_caches
      list via kmem_cache::memcg_params::list.  This list is traversed, for
      example, when we read memory.kmem.slabinfo.
      
      Since the list actually consists of memcg_cache_params objects, we have
      to convert an element of the list to a kmem_cache object using
      memcg_params_to_cache(), which obtains the pointer to the cache from the
      memcg_params::memcg_caches array of the corresponding root cache.  That
      said the pointer to a kmem_cache in its parent's memcg_params must be
      initialized before adding the cache to the list, and cleared only after
      it has been unlinked.  Currently it is vice-versa, which can result in a
      NULL ptr dereference while traversing the memcg_slab_caches list.  This
      patch restores the correct order.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96403da2
    • V
      memcg, slab: fix barrier usage when accessing memcg_caches · 959c8963
      Vladimir Davydov 提交于
      Each root kmem_cache has pointers to per-memcg caches stored in its
      memcg_params::memcg_caches array.  Whenever we want to allocate a slab
      for a memcg, we access this array to get per-memcg cache to allocate
      from (see memcg_kmem_get_cache()).  The access must be lock-free for
      performance reasons, so we should use barriers to assert the kmem_cache
      is up-to-date.
      
      First, we should place a write barrier immediately before setting the
      pointer to it in the memcg_caches array in order to make sure nobody
      will see a partially initialized object.  Second, we should issue a read
      barrier before dereferencing the pointer to conform to the write
      barrier.
      
      However, currently the barrier usage looks rather strange.  We have a
      write barrier *after* setting the pointer and a read barrier *before*
      reading the pointer, which is incorrect.  This patch fixes this.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      959c8963
    • V
      memcg, slab: clean up memcg cache initialization/destruction · 1aa13254
      Vladimir Davydov 提交于
      Currently, we have rather a messy function set relating to per-memcg
      kmem cache initialization/destruction.
      
      Per-memcg caches are created in memcg_create_kmem_cache().  This
      function calls kmem_cache_create_memcg() to allocate and initialize a
      kmem cache and then "registers" the new cache in the
      memcg_params::memcg_caches array of the parent cache.
      
      During its work-flow, kmem_cache_create_memcg() executes the following
      memcg-related functions:
      
       - memcg_alloc_cache_params(), to initialize memcg_params of the newly
         created cache;
       - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
         list.
      
      On the other hand, kmem_cache_destroy() called on a cache destruction
      only calls memcg_release_cache(), which does all the work: it cleans the
      reference to the cache in its parent's memcg_params::memcg_caches,
      removes the cache from the memcg_slab_caches list, and frees
      memcg_params.
      
      Such an inconsistency between destruction and initialization paths make
      the code difficult to read, so let's clean this up a bit.
      
      This patch moves all the code relating to registration of per-memcg
      caches (adding to memcg list, setting the pointer to a cache from its
      parent) to the newly created memcg_register_cache() and
      memcg_unregister_cache() functions making the initialization and
      destruction paths look symmetrical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aa13254
    • V
      memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path · 363a044f
      Vladimir Davydov 提交于
      We do not free the cache's memcg_params if __kmem_cache_create fails.
      Fix this.
      
      Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
      because it actually does not register the cache anywhere, but simply
      initialize kmem_cache::memcg_params.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      363a044f
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • V
      memcg: do not use vmalloc for mem_cgroup allocations · 8ff69e2c
      Vladimir Davydov 提交于
      The vmalloc was introduced by 33327948 ("memcgroup: use vmalloc for
      mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
      defining the per-node array in the mem_cgroup structure so that the
      structure could be huge even if the system had the only NUMA node.
      
      The situation was significantly improved by commit 45cf7ebd ("memcg:
      reduce the size of struct memcg 244-fold"), which made the size of the
      mem_cgroup structure calculated dynamically depending on the real number
      of NUMA nodes installed on the system (nr_node_ids), so now there is no
      point in using vmalloc here: the structure is allocated rarely and on
      most systems its size is about 1K.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ff69e2c
  4. 22 1月, 2014 3 次提交
  5. 03 1月, 2014 1 次提交
  6. 13 12月, 2013 3 次提交
    • J
      mm: memcg: do not allow task about to OOM kill to bypass the limit · 1f14c1ac
      Johannes Weiner 提交于
      Commit 49426420 ("mm: memcg: handle non-error OOM situations more
      gracefully") allowed tasks that already entered a memcg OOM condition to
      bypass the memcg limit on subsequent allocation attempts hoping this
      would expedite finishing the page fault and executing the kill.
      
      David Rientjes is worried that this breaks memcg isolation guarantees
      and since there is no evidence that the bypass actually speeds up fault
      processing just change it so that these subsequent charge attempts fail
      outright.  The notable exception being __GFP_NOFAIL charges which are
      required to bypass the limit regardless.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-bt: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f14c1ac
    • J
      mm: memcg: fix race condition between memcg teardown and swapin · 96f1c58d
      Johannes Weiner 提交于
      There is a race condition between a memcg being torn down and a swapin
      triggered from a different memcg of a page that was recorded to belong
      to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).  The
      result is unreclaimable pages pointing to dead memcgs, which can lead to
      anything from endless loops in later memcg teardown (the page is charged
      to all hierarchical parents but is not on any LRU list) or crashes from
      following the dangling memcg pointer.
      
      Memcgs with tasks in them can not be torn down and usually charges don't
      show up in memcgs without tasks.  Swapin with the CONFIG_MEMCG_SWAP
      extension is the notable exception because it charges the cgroup that
      was recorded as owner during swapout, which may be empty and in the
      process of being torn down when a task in another memcg triggers the
      swapin:
      
        teardown:                 swapin:
      
                                  lookup_swap_cgroup_id()
                                  rcu_read_lock()
                                  mem_cgroup_lookup()
                                  css_tryget()
                                  rcu_read_unlock()
        disable css_tryget()
        call_rcu()
          offline_css()
            reparent_charges()
                                  res_counter_charge() (hierarchical!)
                                  css_put()
                                    css_free()
                                  pc->mem_cgroup = dead memcg
                                  add page to dead lru
      
      Add a final reparenting step into css_free() to make sure any such raced
      charges are moved out of the memcg before it's finally freed.
      
      In the longer term it would be cleaner to have the css_tryget() and the
      res_counter charge under the same RCU lock section so that the charge
      reparenting is deferred until the last charge whose tryget succeeded is
      visible.  But this will require more invasive changes that will be
      harder to evaluate and backport into stable, so better defer them to a
      separate change set.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f1c58d
    • J
      mm: memcg: do not declare OOM from __GFP_NOFAIL allocations · a0d8b00a
      Johannes Weiner 提交于
      Commit 84235de3 ("fs: buffer: move allocation failure loop into the
      allocator") started recognizing __GFP_NOFAIL in memory cgroups but
      forgot to disable the OOM killer.
      
      Any task that does not fail allocation will also not enter the OOM
      completion path.  So don't declare an OOM state in this case or it'll be
      leaked and the task be able to bypass the limit until the next
      userspace-triggered page fault cleans up the OOM state.
      Reported-by: NWilliam Dauchy <wdauchy@gmail.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0d8b00a
  7. 06 12月, 2013 2 次提交
    • T
      cgroup: replace cftype->read_seq_string() with cftype->seq_show() · 2da8ca82
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      replaces cftype->read_seq_string() with cftype->seq_show() which is
      not limited to single_open() operation and will map directcly to
      kernfs seq_file interface.
      
      The conversions are mechanical.  As ->seq_show() doesn't have @css and
      @cft, the functions which make use of them are converted to use
      seq_css() and seq_cft() respectively.  In several occassions, e.f. if
      it has seq_string in its name, the function name is updated to fit the
      new method better.
      
      This patch does not introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      2da8ca82
    • T
      memcg: convert away from cftype->read() and ->read_map() · 791badbd
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string(), and all users of cftype->read() can be easily
      served, usually better, by seq_file and other methods.
      
      Update mem_cgroup_read() to return u64 instead of printing itself and
      rename it to mem_cgroup_read_u64(), and update
      mem_cgroup_oom_control_read() to use ->read_seq_string() instead of
      ->read_map().
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      791badbd
  8. 23 11月, 2013 6 次提交
  9. 15 11月, 2013 1 次提交
    • K
      mm, thp: change pmd_trans_huge_lock() to return taken lock · bf929152
      Kirill A. Shutemov 提交于
      With split ptlock it's important to know which lock
      pmd_trans_huge_lock() took.  This patch adds one more parameter to the
      function to return the lock.
      
      In most places migration to new api is trivial.  Exception is
      move_huge_pmd(): we need to take two locks if pmd tables are different.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf929152
  10. 13 11月, 2013 5 次提交