1. 10 7月, 2013 6 次提交
    • L
      memcg: use css_get() in sock_update_memcg() · 5347e5ae
      Li Zefan 提交于
      Use css_get/css_put instead of mem_cgroup_get/put.
      
      Note, if at the same time someone is moving @current to a different
      cgroup and removing the old cgroup, css_tryget() may return false, and
      sock->sk_cgrp won't be initialized, which is fine.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5347e5ae
    • M
      memcg, kmem: fix reference count handling on the error path · f37a9691
      Michal Hocko 提交于
      mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails.
      This is not correct because only memcg_propagate_kmem takes an
      additional reference while mem_cgroup_sockets_init is allowed to fail as
      well (although no current implementation fails) but it doesn't take any
      reference.  This all suggests that it should be memcg_propagate_kmem
      that should clean up after itself so this patch moves mem_cgroup_put
      over there.
      
      Unfortunately this is not that easy (as pointed out by Li Zefan) because
      memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is
      marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if
      memcg_propagate_kmem fails so the additional reference is dropped in
      that case in kmem_cgroup_destroy which means that the reference would be
      dropped two times.
      
      The easiest way then would be to simply remove mem_cgrroup_put from
      mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right
      thing.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.8]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f37a9691
    • M
      Revert "memcg: avoid dangling reference count in creation failure" · fa460c2d
      Michal Hocko 提交于
      This reverts commit e4715f01.
      
      mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops
      an additional reference from all parents so the additional
      mem_cgrroup_put(parent) potentially causes use-after-free.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa460c2d
    • G
      memcg: do not account memory used for cache creation · 425c598d
      Glauber Costa 提交于
      The memory we used to hold the memcg arrays is currently accounted to
      the current memcg.  But that creates a problem, because that memory can
      only be freed after the last user is gone.  Our only way to know which
      is the last user, is to hook up to freeing time, but the fact that we
      still have some in flight kmallocs will prevent freeing to happen.  I
      believe therefore to be just easier to account this memory as global
      overhead.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      425c598d
    • G
      memcg: also test for skip accounting at the page allocation level · 6d42c232
      Glauber Costa 提交于
      The memory we used to hold the memcg arrays is currently accounted to
      the current memcg.  But that creates a problem, because that memory can
      only be freed after the last user is gone.  Our only way to know which
      is the last user, is to hook up to freeing time, but the fact that we
      still have some in flight kmallocs will prevent freeing to happen.  I
      believe therefore to be just easier to account this memory as global
      overhead.
      
      This patch (of 2):
      
      Disabling accounting is only relevant for some specific memcg internal
      allocations.  Therefore we would initially not have such check at
      memcg_kmem_newpage_charge, since direct calls to the page allocator that
      are marked with GFP_KMEMCG only happen outside memcg core.  We are
      mostly concerned with cache allocations and by having this test at
      memcg_kmem_get_cache we are already able to relay the allocation to the
      root cache and bypass the memcg caches altogether.
      
      There is one exception, though: the SLUB allocator does not create large
      order caches, but rather service large kmallocs directly from the page
      allocator.  Therefore, the following sequence, when backed by the SLUB
      allocator:
      
      	memcg_stop_kmem_account();
      	kmalloc(<large_number>)
      	memcg_resume_kmem_account();
      
      would effectively ignore the fact that we should skip accounting, since
      it will drive us directly to this function without passing through the
      cache selector memcg_kmem_get_cache.  Such large allocations are
      extremely rare but can happen, for instance, for the cache arrays.
      
      This was never a problem in practice, because we weren't skipping
      accounting for the cache arrays.  All the allocations we were skipping
      were fairly small.  However, the fact that we were not skipping those
      allocations are a problem and can prevent the memcgs from going away.
      As we fix that, we need to make sure that the fix will also work with
      the SLUB allocator.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reported-by: NMichal Hocko <mhocko@suze.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d42c232
    • J
      memcg: clean up memcg->nodeinfo · 54f72fe0
      Johannes Weiner 提交于
      Remove struct mem_cgroup_lru_info and fold its single member, the
      variably sized nodeinfo[0], directly into struct mem_cgroup.  This
      should make it more obvious why it has to be the last member there.
      
      Also move the comment that's above that special last member below it, so
      it is more visible to somebody that considers appending to the struct
      mem_cgroup.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54f72fe0
  2. 04 7月, 2013 2 次提交
  3. 13 6月, 2013 2 次提交
    • J
      mm: memcontrol: fix lockless reclaim hierarchy iterator · 89dc991f
      Johannes Weiner 提交于
      The lockless reclaim hierarchy iterator currently has a misplaced
      barrier that can lead to use-after-free crashes.
      
      The reclaim hierarchy iterator consist of a sequence count and a
      position pointer that are read and written locklessly, with memory
      barriers enforcing ordering.
      
      The write side sets the position pointer first, then updates the
      sequence count to "publish" the new position.  Likewise, the read side
      must read the sequence count first, then the position.  If the sequence
      count is up to date, it's guaranteed that the position is up to date as
      well:
      
        writer:                         reader:
        iter->position = position       if iter->sequence == expected:
        smp_wmb()                           smp_rmb()
        iter->sequence = sequence           position = iter->position
      
      However, the read side barrier is currently misplaced, which can lead to
      dereferencing stale position pointers that no longer point to valid
      memory.  Fix this.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: <stable@kernel.org>		[3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89dc991f
    • A
      memcg: don't initialize kmem-cache destroying work for root caches · f101a946
      Andrey Vagin 提交于
      struct memcg_cache_params has a union.  Different parts of this union
      are used for root and non-root caches.  A part with destroying work is
      used only for non-root caches.
      
        BUG: unable to handle kernel paging request at 0000000fffffffe0
        IP: kmem_cache_alloc+0x41/0x1f0
        Modules linked in: netlink_diag af_packet_diag udp_diag tcp_diag inet_diag unix_diag ip6table_filter ip6_tables i2c_piix4 virtio_net virtio_balloon microcode i2c_core pcspkr floppy
        CPU: 0 PID: 1929 Comm: lt-vzctl Tainted: G      D      3.10.0-rc1+ #2
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        RIP: kmem_cache_alloc+0x41/0x1f0
        Call Trace:
         getname_flags.part.34+0x30/0x140
         getname+0x38/0x60
         do_sys_open+0xc5/0x1e0
         SyS_open+0x22/0x30
         system_call_fastpath+0x16/0x1b
        Code: f4 53 48 83 ec 18 8b 05 8e 53 b7 00 4c 8b 4d 08 21 f0 a8 10 74 0d 4c 89 4d c0 e8 1b 76 4a 00 4c 8b 4d c0 e9 92 00 00 00 4d 89 f5 <4d> 8b 45 00 65 4c 03 04 25 48 cd 00 00 49 8b 50 08 4d 8b 38 49
        RIP  [<ffffffff8116b641>] kmem_cache_alloc+0x41/0x1f0
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: <stable@vger.kernel.org>	[3.9.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f101a946
  4. 25 5月, 2013 1 次提交
  5. 08 5月, 2013 1 次提交
  6. 30 4月, 2013 11 次提交
    • L
      memcg: take reference before releasing rcu_read_lock · ca0dde97
      Li Zefan 提交于
      The memcg is not referenced, so it can be destroyed at anytime right
      after we exit rcu read section, so it's not safe to access it.
      
      To fix this, we call css_tryget() to get a reference while we're still
      in rcu read section.
      
      This also removes a bogus comment above __memcg_create_cache_enqueue().
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca0dde97
    • D
      mm, memcg: give exiting processes access to memory reserves · 465adcf1
      David Rientjes 提交于
      A memcg may livelock when oom if the process that grabs the hierarchy's
      oom lock is never the first process with PF_EXITING set in the memcg's
      task iteration.
      
      The oom killer, both global and memcg, will defer if it finds an
      eligible process that is in the process of exiting and it is not being
      ptraced.  The idea is to allow it to exit without using memory reserves
      before needlessly killing another process.
      
      This normally works fine except in the memcg case with a large number of
      threads attached to the oom memcg.  In this case, the memcg oom killer
      only gets called for the process that grabs the hierarchy's oom lock;
      all others end up blocked on the memcg's oom waitqueue.  Thus, if the
      process that grabs the hierarchy's oom lock is never the first
      PF_EXITING process in the memcg's task iteration, the oom killer is
      constantly deferred without anything making progress.
      
      The fix is to give PF_EXITING processes access to memory reserves so
      that we've marked them as oom killed without any iteration.  This allows
      __mem_cgroup_try_charge() to succeed so that the process may exit.  This
      makes the memcg oom killer exemption for TIF_MEMDIE tasks, now
      immediately granted for processes with pending SIGKILLs and those in the
      exit path, to be equivalent to what is done for the global oom killer.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      465adcf1
    • L
      memcg: avoid accessing memcg after releasing reference · fd0ccaf2
      Li Zefan 提交于
      This might cause a use-after-free bug.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd0ccaf2
    • A
      memcg: add memory.pressure_level events · 70ddf637
      Anton Vorontsov 提交于
      With this patch userland applications that want to maintain the
      interactivity/memory allocation cost can use the pressure level
      notifications.  The levels are defined like this:
      
      The "low" level means that the system is reclaiming memory for new
      allocations.  Monitoring this reclaiming activity might be useful for
      maintaining cache level.  Upon notification, the program (typically
      "Activity Manager") might analyze vmstat and act in advance (i.e.
      prematurely shutdown unimportant services).
      
      The "medium" level means that the system is experiencing medium memory
      pressure, the system might be making swap, paging out active file
      caches, etc.  Upon this event applications may decide to further analyze
      vmstat/zoneinfo/memcg or internal memory usage statistics and free any
      resources that can be easily reconstructed or re-read from a disk.
      
      The "critical" level means that the system is actively thrashing, it is
      about to out of memory (OOM) or even the in-kernel OOM killer is on its
      way to trigger.  Applications should do whatever they can to help the
      system.  It might be too late to consult with vmstat or any other
      statistics, so it's advisable to take an immediate action.
      
      The events are propagated upward until the event is handled, i.e.  the
      events are not pass-through.  Here is what this means: for example you
      have three cgroups: A->B->C.  Now you set up an event listener on
      cgroups A, B and C, and suppose group C experiences some pressure.  In
      this situation, only group C will receive the notification, i.e.  groups
      A and B will not receive it.  This is done to avoid excessive
      "broadcasting" of messages, which disturbs the system and which is
      especially bad if we are low on memory or thrashing.  So, organize the
      cgroups wisely, or propagate the events manually (or, ask us to
      implement the pass-through events, explaining why would you need them.)
      
      Performance wise, the memory pressure notifications feature itself is
      lightweight and does not require much of bookkeeping, in contrast to the
      rest of memcg features.  Unfortunately, as of current memcg
      implementation, pages accounting is an inseparable part and cannot be
      turned off.  The good news is that there are some efforts[1] to improve
      the situation; plus, implementing the same, fully API-compatible[2]
      interface for CONFIG_MEMCG=n case (e.g.  embedded) is also a viable
      option, so it will not require any changes on the userland side.
      
      [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291
      [2] http://lkml.org/lkml/2013/2/21/454
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings]
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ddf637
    • M
      mm/memcontrol.c: remove unnecessary ; · 573b400d
      Michel Lespinasse 提交于
      Just a trivial issue I stumbled on while doing something else...
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      573b400d
    • M
      memcg: do not check for do_swap_account in mem_cgroup_{read,write,reset} · acb6d558
      Michal Hocko 提交于
      Since commit 2d11085e ("memcg: do not create memsw files if swap
      accounting is disabled") memsw files are created only if memcg swap
      accounting is enabled so it doesn't make any sense to check for it
      explicitly in mem_cgroup_read(), mem_cgroup_write() and
      mem_cgroup_reset().
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acb6d558
    • M
      memcg: further simplify mem_cgroup_iter · 16248d8f
      Michal Hocko 提交于
      mem_cgroup_iter basically does two things currently.  It takes care of
      the house keeping (reference counting, raclaim cookie) and it iterates
      through a hierarchy tree (by using cgroup generic tree walk).  The code
      would be much more easier to follow if we move the iteration outside of
      the function (to __mem_cgrou_iter_next) so the distinction is more
      clear.  This patch doesn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16248d8f
    • M
      memcg: simplify mem_cgroup_iter · 19f39402
      Michal Hocko 提交于
      The current implementation of mem_cgroup_iter has to consider both css
      and memcg to find out whether no group has been found (css==NULL - aka
      the loop is completed) and that no memcg is associated with the found
      node (!memcg - aka css_tryget failed because the group is no longer
      alive).  This leads to awkward tweaks like tests for css && !memcg to
      skip the current node.
      
      It will be much easier if we got rid off css variable altogether and
      only rely on memcg.  In order to do that the iteration part has to skip
      dead nodes.  This sounds natural to me and as a nice side effect we will
      get a simple invariant that memcg is always alive when non-NULL and all
      nodes have been visited otherwise.
      
      We could get rid of the surrounding while loop but keep it in for now to
      make review easier.  It will go away in the following patch.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19f39402
    • M
      memcg: relax memcg iter caching · 5f578161
      Michal Hocko 提交于
      Now that the per-node-zone-priority iterator caches memory cgroups
      rather than their css ids we have to be careful and remove them from the
      iterator when they are on the way out otherwise they might live for
      unbounded amount of time even though their group is already gone (until
      the global/targeted reclaim triggers the zone under priority to find out
      the group is dead and let it to find the final rest).
      
      We can fix this issue by relaxing rules for the last_visited memcg.
      Instead of taking a reference to the css before it is stored into
      iter->last_visited we can just store its pointer and track the number of
      removed groups from each memcg's subhierarchy.
      
      This number would be stored into iterator everytime when a memcg is
      cached.  If the iter count doesn't match the curent walker root's one we
      will start from the root again.  The group counter is incremented
      upwards the hierarchy every time a group is removed.
      
      The iter_lock can be dropped because racing iterators cannot leak the
      reference anymore as the reference count is not elevated for
      last_visited when it is cached.
      
      Locking rules got a bit complicated by this change though.  The iterator
      primarily relies on rcu read lock which makes sure that once we see a
      valid last_visited pointer then it will be valid for the whole RCU walk.
      smp_rmb makes sure that dead_count is read before last_visited and
      last_dead_count while smp_wmb makes sure that last_visited is updated
      before last_dead_count so the up-to-date last_dead_count cannot point to
      an outdated last_visited.  css_tryget then makes sure that the
      last_visited is still alive in case the iteration races with the cached
      group removal (css is invalidated before mem_cgroup_css_offline
      increments dead_count).
      
      In short:
      mem_cgroup_iter
       rcu_read_lock()
       dead_count = atomic_read(parent->dead_count)
       smp_rmb()
       if (dead_count != iter->last_dead_count)
       	last_visited POSSIBLY INVALID -> last_visited = NULL
       if (!css_tryget(iter->last_visited))
       	last_visited DEAD -> last_visited = NULL
       next = find_next(last_visited)
       css_tryget(next)
       css_put(last_visited) 	// css would be invalidated and parent->dead_count
       			// incremented if this was the last reference
       iter->last_visited = next
       smp_wmb()
       iter->last_dead_count = dead_count
       rcu_read_unlock()
      
      cgroup_rmdir
       cgroup_destroy_locked
        atomic_add(CSS_DEACT_BIAS, &css->refcnt) // subsequent css_tryget fail
         mem_cgroup_css_offline
          mem_cgroup_invalidate_reclaim_iterators
           while(parent = parent_mem_cgroup)
           	atomic_inc(parent->dead_count)
        css_put(css) // last reference held by cgroup core
      
      Spotted by Ying Han.
      
      Original idea from Johannes Weiner.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f578161
    • M
      memcg: rework mem_cgroup_iter to use cgroup iterators · 542f85f9
      Michal Hocko 提交于
      mem_cgroup_iter curently relies on css->id when walking down a group
      hierarchy tree.  This is really awkward because the tree walk depends on
      the groups creation ordering.  The only guarantee is that a parent node is
      visited before its children.
      
      Example:
      
       1) mkdir -p a a/d a/b/c
       2) mkdir -a a/b/c a/d
      
      Will create the same trees but the tree walks will be different:
      
       1) a, d, b, c
       2) a, b, c, d
      
      Commit 574bd9f7 ("cgroup: implement generic child / descendant walk
      macros") has introduced generic cgroup tree walkers which provide either
      pre-order or post-order tree walk.  This patch converts css->id based
      iteration to pre-order tree walk to keep the semantic with the original
      iterator where parent is always visited before its subtree.
      
      cgroup_for_each_descendant_pre suggests using post_create and
      pre_destroy for proper synchronization with groups addidition resp.
      removal.  This implementation doesn't use those because a new memory
      cgroup is initialized sufficiently for iteration in mem_cgroup_css_alloc
      already and css reference counting enforces that the group is alive for
      both the last seen cgroup and the found one resp.  it signals that the
      group is dead and it should be skipped.
      
      If the reclaim cookie is used we need to store the last visited group
      into the iterator so we have to be careful that it doesn't disappear in
      the mean time.  Elevated reference count on the css keeps it alive even
      though the group have been removed (parked waiting for the last dput so
      that it can be freed).
      
      Per node-zone-prio iter_lock has been introduced to ensure that
      css_tryget and iter->last_visited is set atomically.  Otherwise two
      racing walkers could both take a references and only one release it
      leading to a css leak (which pins cgroup dentry).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      542f85f9
    • M
      memcg: keep prev's css alive for the whole mem_cgroup_iter · c40046f3
      Michal Hocko 提交于
      The patchset tries to make mem_cgroup_iter saner in the way how it walks
      hierarchies.  css->id based traversal is far from being ideal as it is not
      deterministic because it depends on the creation ordering.  Additional to
      that css_id is considered a burden for cgroup maintainers because it is
      quite some code and memcg is the last user of it.  After this series only
      the swap accounting uses css_id but that one will follow up later.
      
      Diffstat (if we exclude removed/added comments) looks quite
      promising. We got rid of some code:
      
        $ git diff mmotm... | grep -v "^[+-][[:space:]]*[/ ]\*" | diffstat
         b/include/linux/cgroup.h |    3 ---
         kernel/cgroup.c          |   33 ---------------------------------
         mm/memcontrol.c          |    4 +++-
         3 files changed, 3 insertions(+), 37 deletions(-)
      
      The first patch is just preparatory and it changes when we release css of
      the previously returned memcg.  Nothing controlversial.
      
      The second patch is the core of the patchset and it replaces css_get_next
      based on css_id by the generic cgroup pre-order.  This brings some
      chalanges for the last visited group caching during the reclaim
      (mem_cgroup_per_zone::reclaim_iter).  We have to use memcg pointers
      directly now which means that we have to keep a reference to those groups'
      css to keep them alive.
      
      I also folded iter_lock introduced by https://lkml.org/lkml/2013/1/3/295
      in the previous version into this patch.  Johannes felt the race I was
      describing should be mostly harmless and I haven't been able to trigger it
      so the lock doesn't deserve its own patch.  It is still needed
      temporarily, though, because the reference counting on iter->last_visited
      depends on it.  It will go away with the next patch.
      
      The next patch fixups an unbounded cgroup removal holdoff caused by the
      elevated css refcount.  The issue has been observed by Ying Han.  Johannes
      wasn't impressed by the previous version of the fix
      (https://lkml.org/lkml/2013/2/8/379) which cleaned up pending references
      during mem_cgroup_css_offline when a group is removed.  He has suggested a
      different way when the iterator checks whether a cached memcg is still
      valid or no.  More on that in the patch but the basic idea is that every
      memcg tracks the number removed subgroups and iterator records this number
      when a group is cached.  These numbers are checked before
      iter->last_visited is about to be used and the iteration is restarted if
      it is invalid.
      
      The fourth and fifth patches are an attempt for simplification of the
      mem_cgroup_iter.  css juggling is removed and the iteration logic is moved
      to a helper so that the reference counting and iteration are separated.
      
      The last patch just removes css_get_next as there is no user for it any
      longer.
      
      My testing looked as follows:
              A (use_hierarchy=1, limit_in_bytes=150M)
             /|\
            1 2 3
      
      Children groups were created so that the number is never higher than 3 and
      their limits were random between 50-100M.  Each group hosts a kernel build
      (starting with tar -xf so the tree is not shared and make -jNUM_CPUs/3)
      and terminated after random time - up to 5 minutes) and then it is
      removed.
      
      This should exercise both leaf and hierarchical reclaim as well as races
      with cgroup removals and debugging messages I added on top proved that.
      100 groups were created during the test.
      
      This patch:
      
      css reference counting keeps the cgroup alive even though it has been
      already removed.  mem_cgroup_iter relies on this fact and takes a
      reference to the returned group.  The reference is then released on the
      next iteration or mem_cgroup_iter_break.  mem_cgroup_iter currently
      releases the reference right after it gets the last css_id.
      
      This is correct because neither prev's memcg nor cgroup are accessed after
      then.  This will change in the next patch so we need to hold the group
      alive a bit longer so let's move the css_put at the end of the function.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c40046f3
  7. 16 4月, 2013 1 次提交
  8. 08 4月, 2013 1 次提交
  9. 09 3月, 2013 1 次提交
    • K
      memcg: initialize kmem-cache destroying work earlier · 15cf17d2
      Konstantin Khlebnikov 提交于
      Fix a warning from lockdep caused by calling cancel_work_sync() for
      uninitialized struct work.  This path has been triggered by destructon
      kmem-cache hierarchy via destroying its root kmem-cache.
      
        cache ffff88003c072d80
        obj ffff88003b410000 cache ffff88003c072d80
        obj ffff88003b924000 cache ffff88003c20bd40
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        Pid: 2825, comm: insmod Tainted: G           O 3.9.0-rc1-next-20130307+ #611
        Call Trace:
          __lock_acquire+0x16a2/0x1cb0
          lock_acquire+0x8a/0x120
          flush_work+0x38/0x2a0
          __cancel_work_timer+0x89/0xf0
          cancel_work_sync+0xb/0x10
          kmem_cache_destroy_memcg_children+0x81/0xb0
          kmem_cache_destroy+0xf/0xe0
          init_module+0xcb/0x1000 [kmem_test]
          do_one_initcall+0x11a/0x170
          load_module+0x19b0/0x2320
          SyS_init_module+0xc6/0xf0
          system_call_fastpath+0x16/0x1b
      
      Example module to demonstrate:
      
        #include <linux/module.h>
        #include <linux/slab.h>
        #include <linux/mm.h>
        #include <linux/workqueue.h>
      
        int __init mod_init(void)
        {
        	int size = 256;
        	struct kmem_cache *cache;
        	void *obj;
        	struct page *page;
      
        	cache = kmem_cache_create("kmem_cache_test", size, size, 0, NULL);
        	if (!cache)
        		return -ENOMEM;
      
        	printk("cache %p\n", cache);
      
        	obj = kmem_cache_alloc(cache, GFP_KERNEL);
        	if (obj) {
        		page = virt_to_head_page(obj);
        		printk("obj %p cache %p\n", obj, page->slab_cache);
        		kmem_cache_free(cache, obj);
        	}
      
        	flush_scheduled_work();
      
        	obj = kmem_cache_alloc(cache, GFP_KERNEL);
        	if (obj) {
        		page = virt_to_head_page(obj);
        		printk("obj %p cache %p\n", obj, page->slab_cache);
        		kmem_cache_free(cache, obj);
        	}
      
        	kmem_cache_destroy(cache);
      
        	return -EBUSY;
        }
      
        module_init(mod_init);
        MODULE_LICENSE("GPL");
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15cf17d2
  10. 24 2月, 2013 14 次提交
    • H
      memcg: stop warning on memcg_propagate_kmem · 6d043990
      Hugh Dickins 提交于
      Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
      I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
      "mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
      used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d043990
    • M
      memcg: cleanup mem_cgroup_init comment · 1081312f
      Michal Hocko 提交于
      We should encourage all memcg controller initialization independent on a
      specific mem_cgroup to be done here rather than exploit css_alloc
      callback and assume that nothing happens before root cgroup is created.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1081312f
    • M
      memcg: move memcg_stock initialization to mem_cgroup_init · e4777496
      Michal Hocko 提交于
      memcg_stock are currently initialized during the root cgroup allocation
      which is OK but it pointlessly pollutes memcg allocation code with
      something that can be called when the memcg subsystem is initialized by
      mem_cgroup_init along with other controller specific parts.
      
      This patch wraps the current memcg_stock initialization code into a
      helper calls it from the controller subsystem initialization code.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4777496
    • M
      memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init · 8787a1df
      Michal Hocko 提交于
      Per-node-zone soft limit tree is currently initialized when the root
      cgroup is created which is OK but it pointlessly pollutes memcg
      allocation code with something that can be called when the memcg
      subsystem is initialized by mem_cgroup_init along with other controller
      specific parts.
      
      While we are at it let's make mem_cgroup_soft_limit_tree_init void
      because it doesn't make much sense to report memory failure because if
      we fail to allocate memory that early during the boot then we are
      screwed anyway (this saves some code).
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8787a1df
    • J
      mm: refactor inactive_file_is_low() to use get_lru_size() · e3790144
      Johannes Weiner 提交于
      An inactive file list is considered low when its active counterpart is
      bigger, regardless of whether it is a global zone LRU list or a memcg
      zone LRU list.  The only difference is in how the LRU size is assessed.
      
      get_lru_size() does the right thing for both global and memcg reclaim
      situations.
      
      Get rid of inactive_file_is_low_global() and
      mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare
      the numbers in common code.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3790144
    • G
      memcg: avoid dangling reference count in creation failure. · e4715f01
      Glauber Costa 提交于
      When use_hierarchy is enabled, we acquire an extra reference count in
      our parent during cgroup creation.  We don't release it, though, if any
      failure exist in the creation process.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4715f01
    • G
      memcg: increment static branch right after limit set · 692e89ab
      Glauber Costa 提交于
      We were deferring the kmemcg static branch increment to a later time,
      due to a nasty dependency between the cpu_hotplug lock, taken by the
      jump label update, and the cgroup_lock.
      
      Now we no longer take the cgroup lock, and we can save ourselves the
      trouble.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      692e89ab
    • G
      memcg: replace cgroup_lock with memcg specific memcg_lock · 0999821b
      Glauber Costa 提交于
      After the preparation work done in earlier patches, the cgroup_lock can
      be trivially replaced with a memcg-specific lock.  This is an automatic
      translation at every site where the values involved were queried.
      
      The sites where values are written, however, used to be naturally called
      under cgroup_lock.  This is the case for instance in the css_online
      callback.  For those, we now need to explicitly add the memcg lock.
      
      With this, all the calls to cgroup_lock outside cgroup core are gone.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0999821b
    • G
      memcg: fast hierarchy-aware child test · b5f99b53
      Glauber Costa 提交于
      Currently, we use cgroups' provided list of children to verify if it is
      safe to proceed with any value change that is dependent on the cgroup
      being empty.
      
      This is less than ideal, because it enforces a dependency over cgroup
      core that we would be better off without.  The solution proposed here is
      to iterate over the child cgroups and if any is found that is already
      online, we bounce and return: we don't really care how many children we
      have, only if we have any.
      
      This is also made to be hierarchy aware.  IOW, cgroups with hierarchy
      disabled, while they still exist, will be considered for the purpose of
      this interface as having no children.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5f99b53
    • G
      memcg: split part of memcg creation to css_online · d142e3e6
      Glauber Costa 提交于
      This patch is a preparatory work for later locking rework to get rid of
      big cgroup lock from memory controller code.
      
      The memory controller uses some tunables to adjust its operation.  Those
      tunables are inherited from parent to children upon children
      intialization.  For most of them, the value cannot be changed after the
      parent has a new children.
      
      cgroup core splits initialization in two phases: css_alloc and css_online.
      After css_alloc, the memory allocation and basic initialization are done.
      But the new group is not yet visible anywhere, not even for cgroup core
      code.  It is only somewhere between css_alloc and css_online that it is
      inserted into the internal children lists.  Copying tunable values in
      css_alloc will lead to inconsistent values: the children will copy the old
      parent values, that can change between the copy and the moment in which
      the groups is linked to any data structure that can indicate the presence
      of children.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d142e3e6
    • G
      memcg: prevent changes to move_charge_at_immigrate during task attach · ee5e8472
      Glauber Costa 提交于
      In memcg, we use the cgroup_lock basically to synchronize against
      attaching new children to a cgroup.  We do this because we rely on
      cgroup core to provide us with this information.
      
      We need to guarantee that upon child creation, our tunables are
      consistent.  For those, the calls to cgroup_lock() all live in handlers
      like mem_cgroup_hierarchy_write(), where we change a tunable in the
      group that is hierarchy-related.  For instance, the use_hierarchy flag
      cannot be changed if the cgroup already have children.
      
      Furthermore, those values are propagated from the parent to the child
      when a new child is created.  So if we don't lock like this, we can end
      up with the following situation:
      
      A                                   B
       memcg_css_alloc()                       mem_cgroup_hierarchy_write()
       copy use hierarchy from parent          change use hierarchy in parent
       finish creation.
      
      This is mainly because during create, we are still not fully connected
      to the css tree.  So all iterators and the such that we could use, will
      fail to show that the group has children.
      
      My observation is that all of creation can proceed in parallel with
      those tasks, except value assignment.  So what this patch series does is
      to first move all value assignment that is dependent on parent values
      from css_alloc to css_online, where the iterators all work, and then we
      lock only the value assignment.  This will guarantee that parent and
      children always have consistent values.  Together with an online test,
      that can be derived from the observation that the refcount of an online
      memcg can be made to be always positive, we should be able to
      synchronize our side without the cgroup lock.
      
      This patch:
      
      Currently, we rely on the cgroup_lock() to prevent changes to
      move_charge_at_immigrate during task migration.  However, this is only
      needed because the current strategy keeps checking this value throughout
      the whole process.  Since all we need is serialization, one needs only
      to guarantee that whatever decision we made in the beginning of a
      specific migration is respected throughout the process.
      
      We can achieve this by just saving it in mc.  By doing this, no kind of
      locking is needed.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee5e8472
    • G
      memcg: reduce the size of struct memcg 244-fold. · 45cf7ebd
      Glauber Costa 提交于
      In order to maintain all the memcg bookkeeping, we need per-node
      descriptors, which will in turn contain a per-zone descriptor.
      
      Because we want to statically allocate those, this array ends up being
      very big.  Part of the reason is that we allocate something large enough
      to hold MAX_NUMNODES, the compile time constant that holds the maximum
      number of nodes we would ever consider.
      
      However, we can do better in some cases if the firmware help us.  This
      is true for modern x86 machines; coincidentally one of the architectures
      in which MAX_NUMNODES tends to be very big.
      
      By using the firmware-provided maximum number of nodes instead of
      MAX_NUMNODES, we can reduce the memory footprint of struct memcg
      considerably.  In the extreme case in which we have only one node, this
      reduces the size of the structure from ~ 64k to ~2k.  This is
      particularly important because it means that we will no longer resort to
      the vmalloc area for the struct memcg on defconfigs.  We also have
      enough room for an extra node and still be outside vmalloc.
      
      One also has to keep in mind that with the industry's ability to fit
      more processors in a die as fast as the FED prints money, a nodes = 2
      configuration is already respectably big.
      
      [akpm@linux-foundation.org: add check for invalid nid, remove inline]
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45cf7ebd
    • M
      memcg: clean up swap accounting initialization code · 6acc8b02
      Michal Hocko 提交于
      Memcg swap accounting is currently enabled by enable_swap_cgroup when
      the root cgroup is created.  mem_cgroup_init acts as a memcg subsystem
      initializer which sounds like a much better place for enable_swap_cgroup
      as well.  We already register memsw files from there so it makes a lot
      of sense to merge those two into a single enable_swap_cgroup function.
      
      This patch doesn't introduce any semantic changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Zhouping Liu <zliu@redhat.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: CAI Qian <caiqian@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6acc8b02
    • M
      memcg: do not create memsw files if swap accounting is disabled · 2d11085e
      Michal Hocko 提交于
      Zhouping Liu has reported that memsw files are exported even though swap
      accounting is runtime disabled if MEMCG_SWAP is enabled.  This behavior
      has been introduced by commit af36f906 ("memcg: always create memsw
      files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open
      the file to return EOPNOTSUPP.  Although EOPNOTSUPP should say be clear
      that memsw operations are not supported in the given configuration it is
      fair to say that this behavior could be quite confusing.
      
      Let's tear memsw files out of default cgroup files and add them only if
      the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or
      swapaccount=1 boot parameter).  We can hook into mem_cgroup_init which
      is called when the memcg subsystem is initialized and which happens
      after boot command line is processed.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NZhouping Liu <zliu@redhat.com>
      Tested-by: NZhouping Liu <zliu@redhat.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: CAI Qian <caiqian@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d11085e