1. 17 5月, 2014 2 次提交
    • M
      memcg: remove tasks/children test from mem_cgroup_force_empty() · f61c42a7
      Michal Hocko 提交于
      Tejun has correctly pointed out that tasks/children test in
      mem_cgroup_force_empty is not correct because there is no other locking
      which preserves this state throughout the rest of the function so both
      new tasks can join the group or new children groups can be added while
      somebody is writing to memory.force_empty. A new task would break
      mem_cgroup_reparent_charges expectation that all failures as described
      by mem_cgroup_force_empty_list are temporal and there is no way out.
      
      The main use case for the knob as described by
      Documentation/cgroups/memory.txt is to:
      "
        The typical use case for this interface is before calling rmdir().
        Because rmdir() moves all pages to parent, some out-of-use page caches can be
        moved to the parent. If you want to avoid that, force_empty will be useful.
      "
      
      This means that reparenting is not really required as rmdir will
      reparent pages implicitly from the safe context. If we remove it from
      mem_cgroup_force_empty then we are safe even with existing tasks because
      the number of reclaim attempts is bounded. Moreover the knob still does
      what the documentation claims (modulo reparenting which doesn't make any
      difference) and users might expect. Longterm we want to deprecate the
      whole knob and put the reparented pages to the tail of parent LRU during
      cgroup removal.
      
      tj: Removed unused variable @cgrp from mem_cgroup_force_empty()
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f61c42a7
    • T
      cgroup: remove css_parent() · 5c9d535b
      Tejun Heo 提交于
      cgroup in general is moving towards using cgroup_subsys_state as the
      fundamental structural component and css_parent() was introduced to
      convert from using cgroup->parent to css->parent.  It was quite some
      time ago and we're moving forward with making css more prominent.
      
      This patch drops the trivial wrapper css_parent() and let the users
      dereference css->parent.  While at it, explicitly mark fields of css
      which are public and immutable.
      
      v2: New usage from device_cgroup.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      5c9d535b
  2. 14 5月, 2014 3 次提交
    • T
      cgroup: replace cftype->trigger() with cftype->write() · 6770c64e
      Tejun Heo 提交于
      cftype->trigger() is pointless.  It's trivial to ignore the input
      buffer from a regular ->write() operation.  Convert all ->trigger()
      users to ->write() and remove ->trigger().
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      6770c64e
    • T
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo 提交于
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
    • T
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo 提交于
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
  3. 05 5月, 2014 2 次提交
    • T
      cgroup, memcg: implement css->id and convert css_from_id() to use it · 15a4c835
      Tejun Heo 提交于
      Until now, cgroup->id has been used to identify all the associated
      csses and css_from_id() takes cgroup ID and returns the matching css
      by looking up the cgroup and then dereferencing the css associated
      with it; however, now that the lifetimes of cgroup and css are
      separate, this is incorrect and breaks on the unified hierarchy when a
      controller is disabled and enabled back again before the previous
      instance is released.
      
      This patch adds css->id which is a subsystem-unique ID and converts
      css_from_id() to look up by the new css->id instead.  memcg is the
      only user of css_from_id() and also converted to use css->id instead.
      
      For traditional hierarchies, this shouldn't make any functional
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      15a4c835
    • T
      cgroup, memcg: allocate cgroup ID from 1 · 7d699ddb
      Tejun Heo 提交于
      Currently, cgroup->id is allocated from 0, which is always assigned to
      the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
      invalid IDs and ends up incrementing all IDs by one.
      
      It's reasonable to reserve 0 for special purposes.  This patch updates
      cgroup core so that ID 0 is not used and the root cgroups get ID 1.
      The ID incrementing is removed form memcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7d699ddb
  4. 08 4月, 2014 13 次提交
  5. 19 3月, 2014 1 次提交
    • T
      cgroup: drop const from @buffer of cftype->write_string() · 4d3bb511
      Tejun Heo 提交于
      cftype->write_string() just passes on the writeable buffer from kernfs
      and there's no reason to add const restriction on the buffer.  The
      only thing const achieves is unnecessarily complicating parsing of the
      buffer.  Drop const from @buffer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>                                           
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      4d3bb511
  6. 04 3月, 2014 2 次提交
    • F
      memcg: reparent charges of children before processing parent · 4fb1a86f
      Filipe Brandenburger 提交于
      Sometimes the cleanup after memcg hierarchy testing gets stuck in
      mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
      
      There may turn out to be several causes, but a major cause is this: the
      workitem to offline parent can get run before workitem to offline child;
      parent's mem_cgroup_reparent_charges() circles around waiting for the
      child's pages to be reparented to its lrus, but it's holding
      cgroup_mutex which prevents the child from reaching its
      mem_cgroup_reparent_charges().
      
      Further testing showed that an ordered workqueue for cgroup_destroy_wq
      is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
      stage on the way can mess up the order before reaching the workqueue.
      
      Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
      all its children (and grandchildren, in the correct order) to have their
      charges reparented first.
      
      Fixes: e5fca243 ("cgroup: use a dedicated workqueue for cgroup destruction")
      Signed-off-by: NFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fb1a86f
    • H
      memcg: fix endless loop in __mem_cgroup_iter_next() · ce48225f
      Hugh Dickins 提交于
      Commit 0eef6156 ("memcg: fix css reference leak and endless loop in
      mem_cgroup_iter") got the interaction with the commit a few before it
      d8ad3055 ("mm/memcg: iteration skip memcgs not yet fully
      initialized") slightly wrong, and we didn't notice at the time.
      
      It's elusive, and harder to get than the original, but for a couple of
      days before rc1, I several times saw a endless loop similar to that
      supposedly being fixed.
      
      This time it was a tighter loop in __mem_cgroup_iter_next(): because we
      can get here when our root has already been offlined, and the ordering
      of conditions was such that we then just cycled around forever.
      
      Fixes: 0eef6156 ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce48225f
  7. 26 2月, 2014 1 次提交
    • M
      memcg: change oom_info_lock to mutex · 08088cb9
      Michal Hocko 提交于
      Kirill has reported the following:
      
        Task in /test killed as a result of limit of /test
        memory: usage 10240kB, limit 10240kB, failcnt 51
        memory+swap: usage 10240kB, limit 10240kB, failcnt 0
        kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
        Memory cgroup stats for /test:
      
        BUG: sleeping function called from invalid context at kernel/cpu.c:68
        in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
        2 locks held by memcg_test/66:
         #0:  (memcg_oom_lock#2){+.+...}, at: [<ffffffff81131014>] pagefault_out_of_memory+0x14/0x90
         #1:  (oom_info_lock){+.+...}, at: [<ffffffff81197b2a>] mem_cgroup_print_oom_info+0x2a/0x390
        CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
        Call Trace:
          __might_sleep+0x16a/0x210
          get_online_cpus+0x1c/0x60
          mem_cgroup_read_stat+0x27/0xb0
          mem_cgroup_print_oom_info+0x260/0x390
          dump_header+0x88/0x251
          ? trace_hardirqs_on+0xd/0x10
          oom_kill_process+0x258/0x3d0
          mem_cgroup_oom_synchronize+0x656/0x6c0
          ? mem_cgroup_charge_common+0xd0/0xd0
          pagefault_out_of_memory+0x14/0x90
          mm_fault_error+0x91/0x189
          __do_page_fault+0x48e/0x580
          do_page_fault+0xe/0x10
          page_fault+0x22/0x30
      
      which complains that mem_cgroup_read_stat cannot be called from an atomic
      context but mem_cgroup_print_oom_info takes a spinlock.  Change
      oom_info_lock to a mutex.
      
      This was introduced by 947b3dd1 ("memcg, oom: lock
      mem_cgroup_print_oom_info").
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: N"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08088cb9
  8. 13 2月, 2014 1 次提交
    • T
      cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count() · 07bc356e
      Tejun Heo 提交于
      cgroup_task_count() read-locks css_set_lock and walks all tasks to
      count them and then returns the result.  The only thing all the users
      want is determining whether the cgroup is empty or not.  This patch
      implements cgroup_has_tasks() which tests whether cgroup->cset_links
      is empty, replaces all cgroup_task_count() usages and unexports it.
      
      Note that the test isn't synchronized.  This is the same as before.
      The test has always been racy.
      
      This will help planned css_set locking update.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      07bc356e
  9. 12 2月, 2014 2 次提交
    • T
      cgroup: remove cgroup->name · e61734c5
      Tejun Heo 提交于
      cgroup->name handling became quite complicated over time involving
      dedicated struct cgroup_name for RCU protection.  Now that cgroup is
      on kernfs, we can drop all of it and simply use kernfs_name/path() and
      friends.  Replace cgroup->name and all related code with kernfs
      name/path constructs.
      
      * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
        of kernfs counterparts, which involves semantic changes.
        pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
      
      * cgroup->name handling dropped from cgroup_rename().
      
      * All users of cgroup_name/path() updated to the new semantics.  Users
        which were formatting the string just to printk them are converted
        to use pr_cont_cgroup_name/path() instead, which simplifies things
        quite a bit.  As cgroup_name() no longer requires RCU read lock
        around it, RCU lockings which were protecting only cgroup_name() are
        removed.
      
      v2: Comment above oom_info_lock updated as suggested by Michal.
      
      v3: dummy_top doesn't have a kn associated and
          pr_cont_cgroup_name/path() ended up calling the matching kernfs
          functions with NULL kn leading to oops.  Test for NULL kn and
          print "/" if so.  This issue was reported by Fengguang Wu.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      e61734c5
    • T
      cgroup: improve css_from_dir() into css_tryget_from_dir() · 5a17f543
      Tejun Heo 提交于
      css_from_dir() returns the matching css (cgroup_subsys_state) given a
      dentry and subsystem.  The function doesn't pin the css before
      returning and requires the caller to be holding RCU read lock or
      cgroup_mutex and handling pinning on the caller side.
      
      Given that users of the function are likely to want to pin the
      returned css (both existing users do) and that getting and putting
      css's are very cheap, there's no reason for the interface to be tricky
      like this.
      
      Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
      the found css and return it only if pinning succeeded.  The callers
      are updated so that they no longer do RCU locking and pinning around
      the function and just use the returned css.
      
      This will also ease converting cgroup to kernfs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      5a17f543
  10. 08 2月, 2014 1 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
  11. 31 1月, 2014 1 次提交
  12. 24 1月, 2014 11 次提交
    • V
      memcg: remove unused code from kmem_cache_destroy_work_func · 0d8a4a37
      Vladimir Davydov 提交于
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d8a4a37
    • M
      memcg: fix css reference leak and endless loop in mem_cgroup_iter · 0eef6156
      Michal Hocko 提交于
      Commit 19f39402 ("memcg: simplify mem_cgroup_iter") has reorganized
      mem_cgroup_iter code in order to simplify it.  A part of that change was
      dropping an optimization which didn't call css_tryget on the root of the
      walked tree.  The patch however didn't change the css_put part in
      mem_cgroup_iter which excludes root.
      
      This wasn't an issue at the time because __mem_cgroup_iter_next bailed
      out for root early without taking a reference as cgroup iterators
      (css_next_descendant_pre) didn't visit root themselves.
      
      Nevertheless cgroup iterators have been reworked to visit root by commit
      bd8815a6 ("cgroup: make css_for_each_descendant() and friends
      include the origin css in the iteration") when the root bypass have been
      dropped in __mem_cgroup_iter_next.  This means that css_put is not
      called for root and so css along with mem_cgroup and other cgroup
      internal object tied by css lifetime are never freed.
      
      Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
      do not take css reference for it.
      
      This reference counting magic protects us also from another issue, an
      endless loop reported by Hugh Dickins when reclaim races with root
      removal and css_tryget called by iterator internally would fail.  There
      would be no other nodes to visit so __mem_cgroup_iter_next would return
      NULL and mem_cgroup_iter would interpret it as "start looping from root
      again" and so mem_cgroup_iter would loop forever internally.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eef6156
    • M
      memcg: fix endless loop caused by mem_cgroup_iter · ecc736fc
      Michal Hocko 提交于
      Hugh has reported an endless loop when the hardlimit reclaim sees the
      same group all the time.  This might happen when the reclaim races with
      the memcg removal.
      
      shrink_zone
                                                      [rmdir root]
        mem_cgroup_iter(root, NULL, reclaim)
          // prev = NULL
          rcu_read_lock()
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root || NULL
            css_tryget(last_visited)            // failed
            last_visited = NULL                 [1]
          memcg = root = __mem_cgroup_iter_next(root, NULL)
          mem_cgroup_iter_update
            iter->last_visited = root;
          reclaim->generation = iter->generation
      
       mem_cgroup_iter(root, root, reclaim)
         // prev = root
         rcu_read_lock
          mem_cgroup_iter_load
            last_visited = iter->last_visited   // gets root
            css_tryget(last_visited)            // failed
          [1]
      
      The issue seemed to be introduced by commit 5f578161 ("memcg: relax
      memcg iter caching") which has replaced unconditional css_get/css_put by
      css_tryget/css_put for the cached iterator.
      
      This patch fixes the issue by skipping css_tryget on the root of the
      tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
      in mem_cgroup_iter_update.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecc736fc
    • D
      mm, oom: prefer thread group leaders for display purposes · d49ad935
      David Rientjes 提交于
      When two threads have the same badness score, it's preferable to kill
      the thread group leader so that the actual process name is printed to
      the kernel log rather than the thread group name which may be shared
      amongst several processes.
      
      This was the behavior when select_bad_process() used to do
      for_each_process(), but it now iterates threads instead and leads to
      ambiguity.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d49ad935
    • H
      mm/memcg: iteration skip memcgs not yet fully initialized · d8ad3055
      Hugh Dickins 提交于
      It is surprising that the mem_cgroup iterator can return memcgs which
      have not yet been fully initialized.  By accident (or trial and error?)
      this appears not to present an actual problem; but it may be better to
      prevent such surprises, by skipping memcgs not yet online.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8ad3055
    • H
      mm/memcg: fix last_dead_count memory wastage · d2ab70aa
      Hugh Dickins 提交于
      Shorten mem_cgroup_reclaim_iter.last_dead_count from unsigned long to
      int: it's assigned from an int and compared with an int, and adjacent to
      an unsigned int: so there's no point to it being unsigned long, which
      wasted 104 bytes in every mem_cgroup_per_zone.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ab70aa
    • V
      memcg: rework memcg_update_kmem_limit synchronization · d6441637
      Vladimir Davydov 提交于
      Currently we take both the memcg_create_mutex and the set_limit_mutex
      when we enable kmem accounting for a memory cgroup, which makes kmem
      activation events serialize with both memcg creations and other memcg
      limit updates (memory.limit, memory.memsw.limit).  However, there is no
      point in such strict synchronization rules there.
      
      First, the set_limit_mutex was introduced to keep the memory.limit and
      memory.memsw.limit values in sync.  Since memory.kmem.limit can be set
      independently of them, it is better to introduce a separate mutex to
      synchronize against concurrent kmem limit updates.
      
      Second, we take the memcg_create_mutex in order to make sure all
      children of this memcg will be kmem-active as well.  For achieving that,
      it is enough to hold this mutex only while checking if
      memcg_has_children() though.  This guarantees that if a child is added
      after we checked that the memcg has no children, the newly added cgroup
      will see its parent kmem-active (of course if the latter succeeded), and
      call kmem activation for itself.
      
      This patch simplifies the locking rules of memcg_update_kmem_limit()
      according to these considerations.
      
      [vdavydov@parallels.com: fix unintialized var warning]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6441637
    • V
      memcg: remove KMEM_ACCOUNTED_ACTIVATED flag · 6de64beb
      Vladimir Davydov 提交于
      Currently we have two state bits in mem_cgroup::kmem_account_flags
      regarding kmem accounting activation, ACTIVATED and ACTIVE.  We start
      kmem accounting only if both flags are set (memcg_can_account_kmem()),
      plus throughout the code there are several places where we check only
      the ACTIVE flag, but we never check the ACTIVATED flag alone.  These
      flags are both set from memcg_update_kmem_limit() under the
      set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
      they never get cleared.  That said checking if both flags are set is
      equivalent to checking only for the ACTIVE flag, and since there is no
      ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
      nothing will change.
      
      Let's try to understand what was the reason for introducing these flags.
      The purpose of the ACTIVE flag is clear - it states that kmem should be
      accounting to the cgroup.  The only requirement for it is that it should
      be set after we have fully initialized kmem accounting bits for the
      cgroup and patched all static branches relating to kmem accounting.
      Since we always check if static branch is enabled before actually
      considering if we should account (otherwise we wouldn't benefit from
      static branching), this guarantees us that we won't skip a commit or
      uncharge after a charge due to an unpatched static branch.
      
      Now let's move on to the ACTIVATED bit.  As I proved in the beginning of
      this message, it is absolutely useless, and removing it will change
      nothing.  So what was the reason introducing it?
      
      The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use
      static branches when code not in use") in order to guarantee that
      static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
      for each memory cgroup when its kmem accounting was activated.  The
      point was that at that time the memcg_update_kmem_limit() function's
      work-flow looked like this:
      
              bool must_inc_static_branch = false;
      
              cgroup_lock();
              mutex_lock(&set_limit_mutex);
              if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                      /* The kmem limit is set for the first time */
                      ret = res_counter_set_limit(&memcg->kmem, val);
      
                      memcg_kmem_set_activated(memcg);
                      must_inc_static_branch = true;
              } else
                      ret = res_counter_set_limit(&memcg->kmem, val);
              mutex_unlock(&set_limit_mutex);
              cgroup_unlock();
      
              if (must_inc_static_branch) {
                      /* We can't do this under cgroup_lock */
                      static_key_slow_inc(&memcg_kmem_enabled_key);
                      memcg_kmem_set_active(memcg);
              }
      
      So that without the ACTIVATED flag we could race with other threads
      trying to set the limit and increment the static branching ref-counter
      more than once.  Today we call the whole memcg_update_kmem_limit()
      function under the set_limit_mutex and this race is impossible.
      
      As now we understand why the ACTIVATED bit was introduced and why we
      don't need it now, and know that removing it will change nothing anyway,
      let's get rid of it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de64beb
    • V
      memcg, slab: RCU protect memcg_params for root caches · f8570263
      Vladimir Davydov 提交于
      We relocate root cache's memcg_params whenever we need to grow the
      memcg_caches array to accommodate all kmem-active memory cgroups.
      Currently on relocation we free the old version immediately, which can
      lead to use-after-free, because the memcg_caches array is accessed
      lock-free (see cache_from_memcg_idx()).  This patch fixes this by making
      memcg_params RCU-protected for root caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8570263
    • V
      memcg: get rid of kmem_cache_dup() · 842e2873
      Vladimir Davydov 提交于
      kmem_cache_dup() is only called from memcg_create_kmem_cache().  The
      latter, in fact, does nothing besides this, so let's fold
      kmem_cache_dup() into memcg_create_kmem_cache().
      
      This patch also makes the memcg_cache_mutex private to
      memcg_create_kmem_cache(), because it is not used anywhere else.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842e2873
    • V
      memcg, slab: fix races in per-memcg cache creation/destruction · 2edefe11
      Vladimir Davydov 提交于
      We obtain a per-memcg cache from a root kmem_cache by dereferencing an
      entry of the root cache's memcg_params::memcg_caches array.  If we find
      no cache for a memcg there on allocation, we initiate the memcg cache
      creation (see memcg_kmem_get_cache()).  The cache creation proceeds
      asynchronously in memcg_create_kmem_cache() in order to avoid lock
      clashes, so there can be several threads trying to create the same
      kmem_cache concurrently, but only one of them may succeed.  However, due
      to a race in the code, it is not always true.  The point is that the
      memcg_caches array can be relocated when we activate kmem accounting for
      a memcg (see memcg_update_all_caches(), memcg_update_cache_size()).  If
      memcg_update_cache_size() and memcg_create_kmem_cache() proceed
      concurrently as described below, we can leak a kmem_cache.
      
      Asume two threads schedule creation of the same kmem_cache.  One of them
      successfully creates it.  Another one should fail then, but if
      memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
      follows, it won't:
      
        memcg_create_kmem_cache()             memcg_update_cache_size()
        (called w/o mutexes held)             (called with slab_mutex,
                                               set_limit_mutex held)
        -------------------------             -------------------------
      
        mutex_lock(&memcg_cache_mutex)
      
                                              s->memcg_params=kzalloc(...)
      
        new_cachep=cache_from_memcg_idx(cachep,idx)
        // new_cachep==NULL => proceed to creation
      
                                              s->memcg_params->memcg_caches[i]
                                                  =cur_params->memcg_caches[i]
      
        // kmem_cache_create_memcg takes slab_mutex
        // so we will hang around until
        // memcg_update_cache_size finishes, but
        // nothing will prevent it from succeeding so
        // memcg_caches[idx] will be overwritten in
        // memcg_register_cache!
      
        new_cachep = kmem_cache_create_memcg(...)
        mutex_unlock(&memcg_cache_mutex)
      
      Let's fix this by moving the check for existence of the memcg cache to
      kmem_cache_create_memcg() to be called under the slab_mutex and make it
      return NULL if so.
      
      A similar race is possible when destroying a memcg cache (see
      kmem_cache_destroy()).  Since memcg_unregister_cache(), which clears the
      pointer in the memcg_caches array, is called w/o protection, we can race
      with memcg_update_cache_size() and omit clearing the pointer.  Therefore
      memcg_unregister_cache() should be moved before we release the
      slab_mutex.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2edefe11