1. 16 3月, 2012 1 次提交
    • H
      memcg: free mem_cgroup by RCU to fix oops · 59927fb9
      Hugh Dickins 提交于
      After fixing the GPF in mem_cgroup_lru_del_list(), three times one
      machine running a similar load (moving and removing memcgs while
      swapping) has oopsed in mem_cgroup_zone_nr_lru_pages(), when retrieving
      memcg zone numbers for get_scan_count() for shrink_mem_cgroup_zone():
      this is where a struct mem_cgroup is first accessed after being chosen
      by mem_cgroup_iter().
      
      Just what protects a struct mem_cgroup from being freed, in between
      mem_cgroup_iter()'s css_get_next() and its css_tryget()? css_tryget()
      fails once css->refcnt is zero with CSS_REMOVED set in flags, yes: but
      what if that memory is freed and reused for something else, which sets
      "refcnt" non-zero? Hmm, and scope for an indefinite freeze if refcnt is
      left at zero but flags are cleared.
      
      It's tempting to move the css_tryget() into css_get_next(), to make it
      really "get" the css, but I don't think that actually solves anything:
      the same difficulty in moving from css_id found to stable css remains.
      
      But we already have rcu_read_lock() around the two, so it's easily fixed
      if __mem_cgroup_free() just uses kfree_rcu() to free mem_cgroup.
      
      However, a big struct mem_cgroup is allocated with vzalloc() instead of
      kzalloc(), and we're not allowed to vfree() at interrupt time: there
      doesn't appear to be a general vfree_rcu() to help with this, so roll
      our own using schedule_work().  The compiler decently removes
      vfree_work() and vfree_rcu() when the config doesn't need them.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59927fb9
  2. 10 3月, 2012 1 次提交
  3. 06 3月, 2012 3 次提交
    • N
      memcg: fix mapcount check in move charge code for anonymous page · e6ca7b89
      Naoya Horiguchi 提交于
      Currently the charge on shared anonyous pages is supposed not to moved in
      task migration.  To implement this, we need to check that mapcount > 1,
      instread of > 2.  So this patch fixes it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6ca7b89
    • H
      memcg: fix GPF when cgroup removal races with last exit · 7512102c
      Hugh Dickins 提交于
      When moving tasks from old memcg (with move_charge_at_immigrate on new
      memcg), followed by removal of old memcg, hit General Protection Fault in
      mem_cgroup_lru_del_list() (called from release_pages called from
      free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from
      exit_mmap from mmput from exit_mm from do_exit).
      
      Somewhat reproducible, takes a few hours: the old struct mem_cgroup has
      been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is
      still trying to update its stats, and take page off lru before freeing.
      
      A task, or a charge, or a page on lru: each secures a memcg against
      removal.  In this case, the last task has been moved out of the old memcg,
      and it is exiting: anonymous pages are uncharged one by one from the
      memcg, as they are zapped from its pagetables, so the charge gets down to
      0; but the pages themselves are queued in an mmu_gather for freeing.
      
      Most of those pages will be on lru (and force_empty is careful to
      lru_add_drain_all, to add pages from pagevec to lru first), but not
      necessarily all: perhaps some have been isolated for page reclaim, perhaps
      some isolated for other reasons.  So, force_empty may find no task, no
      charge and no page on lru, and let the removal proceed.
      
      There would still be no problem if these pages were immediately freed; but
      typically (and the put_page_testzero protocol demands it) they have to be
      added back to lru before they are found freeable, then removed from lru
      and freed.  We don't see the issue when adding, because the
      mem_cgroup_iter() loops keep their own reference to the memcg being
      scanned; but when it comes to mem_cgroup_lru_del_list().
      
      I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and
      PageCgroupUsed flags were used (like a trick with mirrors) to deflect view
      of pc->mem_cgroup to the stable root_mem_cgroup when neither set.
      38c5d72f ("memcg: simplify LRU handling by new rule") mercifully
      removed those convolutions, but left this General Protection Fault.
      
      But it's surprisingly easy to restore the old behaviour: just check
      PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec
      to add), and reset pc to root_mem_cgroup if page is uncharged.  A risky
      change?  just going back to how it worked before; testing, and an audit of
      uses of pc->mem_cgroup, show no problem.
      
      And there's a nice bonus: with mem_cgroup_lru_add_list() itself making
      sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no
      longer has any purpose, and we can safely revert 4e5f01c2 ("memcg:
      clear pc->mem_cgroup if necessary").
      
      Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c
      is not strictly necessary: the lru_lock there, with RCU before memcg
      structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe
      without that; but it seems cleaner to rely on one dependency less.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7512102c
    • H
      memcg: fix deadlock by inverting lrucare nesting · 9ce70c02
      Hugh Dickins 提交于
      We have forgotten the rules of lock nesting: the irq-safe ones must be
      taken inside the non-irq-safe ones, otherwise we are open to deadlock:
      
      CPU0                          CPU1
      ----                          ----
      lock(&(&pc->lock)->rlock);
                                    local_irq_disable();
                                    lock(&(&zone->lru_lock)->rlock);
                                    lock(&(&pc->lock)->rlock);
      <Interrupt>
      lock(&(&zone->lru_lock)->rlock);
      
      To check a different locking issue, I happened to add a spin_lock to
      memcg's bit_spin_lock in lock_page_cgroup(), and lockdep very quickly
      complained about __mem_cgroup_commit_charge_lrucare() (on CPU1 above).
      
      So delete __mem_cgroup_commit_charge_lrucare(), passing a bool lrucare to
      __mem_cgroup_commit_charge() instead, taking zone->lru_lock under
      lock_page_cgroup() in the lrucare case.
      
      The original was using spin_lock_irqsave, but we'd be in more trouble if
      it were ever called at interrupt time: unconditional _irq is enough.  And
      ClearPageLRU before del from lru, SetPageLRU before add to lru: no strong
      reason, but that is the ordering used consistently elsewhere.
      
      Fixes 36b62ad5 ("memcg: simplify corner case handling
      of LRU").
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ce70c02
  4. 25 2月, 2012 1 次提交
    • A
      mm: memcg: Correct unregistring of events attached to the same eventfd · 371528ca
      Anton Vorontsov 提交于
      There is an issue when memcg unregisters events that were attached to
      the same eventfd:
      
      - On the first call mem_cgroup_usage_unregister_event() removes all
        events attached to a given eventfd, and if there were no events left,
        thresholds->primary would become NULL;
      
      - Since there were several events registered, cgroups core will call
        mem_cgroup_usage_unregister_event() again, but now kernel will oops,
        as the function doesn't expect that threshold->primary may be NULL.
      
      That's a good question whether mem_cgroup_usage_unregister_event()
      should actually remove all events in one go, but nowadays it can't
      do any better as cftype->unregister_event callback doesn't pass
      any private event-associated cookie. So, let's fix the issue by
      simply checking for threshold->primary.
      
      FWIW, w/o the patch the following oops may be observed:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
       IP: [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0
       Pid: 574, comm: kworker/0:2 Not tainted 3.3.0-rc4+ #9 Bochs Bochs
       RIP: 0010:[<ffffffff810be32c>]  [<ffffffff810be32c>] mem_cgroup_usage_unregister_event+0x9c/0x1f0
       RSP: 0018:ffff88001d0b9d60  EFLAGS: 00010246
       Process kworker/0:2 (pid: 574, threadinfo ffff88001d0b8000, task ffff88001de91cc0)
       Call Trace:
        [<ffffffff8107092b>] cgroup_event_remove+0x2b/0x60
        [<ffffffff8103db94>] process_one_work+0x174/0x450
        [<ffffffff8103e413>] worker_thread+0x123/0x2d0
      
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      371528ca
  5. 04 2月, 2012 1 次提交
  6. 24 1月, 2012 1 次提交
  7. 23 1月, 2012 1 次提交
  8. 17 1月, 2012 1 次提交
  9. 13 1月, 2012 24 次提交
  10. 08 1月, 2012 1 次提交
    • G
      net: fix sock_clone reference mismatch with tcp memcontrol · f3f511e1
      Glauber Costa 提交于
      Sockets can also be created through sock_clone. Because it copies
      all data in the sock structure, it also copies the memcg-related pointer,
      and all should be fine. However, since we now use reference counts in
      socket creation, we are left with some sockets that have no reference
      counts. It matters when we destroy them, since it leads to a mismatch.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      CC: David S. Miller <davem@davemloft.net>
      CC: Greg Thelen <gthelen@google.com>
      CC: Hiroyouki Kamezawa <kamezawa.hiroyu@jp.fujitsu.com>
      CC: Laurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3f511e1
  11. 23 12月, 2011 1 次提交
    • G
      Partial revert "Basic kernel memory functionality for the Memory Controller" · 65c64ce8
      Glauber Costa 提交于
      This reverts commit e5671dfa.
      
      After a follow up discussion with Michal, it was agreed it would
      be better to leave the kmem controller with just the tcp files,
      deferring the behavior of the other general memory.kmem.* files
      for a later time, when more caches are controlled. This is because
      generic kmem files are not used by tcp accounting and it is
      not clear how other slab caches would fit into the scheme.
      
      We are reverting the original commit so we can track the reference.
      Part of the patch is kept, because it was used by the later tcp
      code. Conflicts are shown in the bottom. init/Kconfig is removed from
      the revert entirely.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      CC: Kirill A. Shutemov <kirill@shutemov.name>
      CC: Paul Menage <paul@paulmenage.org>
      CC: Greg Thelen <gthelen@google.com>
      CC: Johannes Weiner <jweiner@redhat.com>
      CC: David S. Miller <davem@davemloft.net>
      
      Conflicts:
      
      	Documentation/cgroups/memory.txt
      	mm/memcontrol.c
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65c64ce8
  12. 21 12月, 2011 1 次提交
  13. 13 12月, 2011 3 次提交