1. 11 12月, 2014 10 次提交
  2. 30 10月, 2014 1 次提交
    • J
      mm: memcontrol: fix missed end-writeback page accounting · d7365e78
      Johannes Weiner 提交于
      Commit 0a31bc97 ("mm: memcontrol: rewrite uncharge API") changed
      page migration to uncharge the old page right away.  The page is locked,
      unmapped, truncated, and off the LRU, but it could race with writeback
      ending, which then doesn't unaccount the page properly:
      
      test_clear_page_writeback()              migration
                                                 wait_on_page_writeback()
        TestClearPageWriteback()
                                                 mem_cgroup_migrate()
                                                   clear PCG_USED
        mem_cgroup_update_page_stat()
          if (PageCgroupUsed(pc))
            decrease memcg pages under writeback
      
        release pc->mem_cgroup->move_lock
      
      The per-page statistics interface is heavily optimized to avoid a
      function call and a lookup_page_cgroup() in the file unmap fast path,
      which means it doesn't verify whether a page is still charged before
      clearing PageWriteback() and it has to do it in the stat update later.
      
      Rework it so that it looks up the page's memcg once at the beginning of
      the transaction and then uses it throughout.  The charge will be
      verified before clearing PageWriteback() and migration can't uncharge
      the page as long as that is still set.  The RCU lock will protect the
      memcg past uncharge.
      
      As far as losing the optimization goes, the following test results are
      from a microbenchmark that maps, faults, and unmaps a 4GB sparse file
      three times in a nested fashion, so that there are two negative passes
      that don't account but still go through the new transaction overhead.
      There is no actual difference:
      
       old:     33.195102545 seconds time elapsed       ( +-  0.01% )
       new:     33.199231369 seconds time elapsed       ( +-  0.03% )
      
      The time spent in page_remove_rmap()'s callees still adds up to the
      same, but the time spent in the function itself seems reduced:
      
           # Children      Self  Command        Shared Object       Symbol
       old:     0.12%     0.11%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
       new:     0.12%     0.08%  filemapstress  [kernel.kallsyms]   [k] page_remove_rmap
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[3.17.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7365e78
  3. 10 10月, 2014 6 次提交
    • V
      memcg: zap memcg_can_account_kmem · cf2b8fbf
      Vladimir Davydov 提交于
      memcg_can_account_kmem() returns true iff
      
          !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
                                         memcg_kmem_is_active(memcg);
      
      To begin with the !mem_cgroup_is_root(memcg) check is useless, because one
      can't enable kmem accounting for the root cgroup (mem_cgroup_write()
      returns EINVAL on an attempt to set the limit on the root cgroup).
      
      Furthermore, the !mem_cgroup_disabled() check also seems to be redundant.
      The point is memcg_can_account_kmem() is called from three places:
      mem_cgroup_salbinfo_read(), __memcg_kmem_get_cache(), and
      __memcg_kmem_newpage_charge().  The latter two functions are only invoked
      if memcg_kmem_enabled() returns true, which implies that the memory cgroup
      subsystem is enabled.  And mem_cgroup_slabinfo_read() shows the output of
      memory.kmem.slabinfo, which won't exist if the memory cgroup is completely
      disabled.
      
      So let's substitute all the calls to memcg_can_account_kmem() with plain
      memcg_kmem_is_active(), and kill the former.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf2b8fbf
    • J
      mm: memcontrol: fix transparent huge page allocations under pressure · b70a2a21
      Johannes Weiner 提交于
      In a memcg with even just moderate cache pressure, success rates for
      transparent huge page allocations drop to zero, wasting a lot of effort
      that the allocator puts into assembling these pages.
      
      The reason for this is that the memcg reclaim code was never designed for
      higher-order charges.  It reclaims in small batches until there is room
      for at least one page.  Huge page charges only succeed when these batches
      add up over a series of huge faults, which is unlikely under any
      significant load involving order-0 allocations in the group.
      
      Remove that loop on the memcg side in favor of passing the actual reclaim
      goal to direct reclaim, which is already set up and optimized to meet
      higher-order goals efficiently.
      
      This brings memcg's THP policy in line with the system policy: if the
      allocator painstakingly assembles a hugepage, memcg will at least make an
      honest effort to charge it.  As a result, transparent hugepage allocation
      rates amid cache activity are drastically improved:
      
                                            vanilla                 patched
      pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
      pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
      pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
      thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
      thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
      
      [ Note: this may in turn increase memory consumption from internal
        fragmentation, which is an inherent risk of transparent hugepages.
        Some setups may have to adjust the memcg limits accordingly to
        accomodate this - or, if the machine is already packed to capacity,
        disable the transparent huge page feature. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70a2a21
    • J
      mm: memcontrol: simplify detecting when the memory+swap limit is hit · 3fbe7244
      Johannes Weiner 提交于
      When attempting to charge pages, we first charge the memory counter and
      then the memory+swap counter.  If one of the counters is at its limit, we
      enter reclaim, but if it's the memory+swap counter, reclaim shouldn't swap
      because that wouldn't change the situation.  However, if the counters have
      the same limits, we never get to the memory+swap limit.  To know whether
      reclaim should swap or not, there is a state flag that indicates whether
      the limits are equal and whether hitting the memory limit implies hitting
      the memory+swap limit.
      
      Just try the memory+swap counter first.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fbe7244
    • V
      memcg: move memcg_update_cache_size() to slab_common.c · 6f817f4c
      Vladimir Davydov 提交于
      `While growing per memcg caches arrays, we jump between memcontrol.c and
      slab_common.c in a weird way:
      
        memcg_alloc_cache_id - memcontrol.c
          memcg_update_all_caches - slab_common.c
            memcg_update_cache_size - memcontrol.c
      
      There's absolutely no reason why memcg_update_cache_size can't live on the
      slab's side though.  So let's move it there and settle it comfortably amid
      per-memcg cache allocation functions.
      
      Besides, this patch cleans this function up a bit, removing all the
      useless comments from it, and renames it to memcg_update_cache_params to
      conform to memcg_alloc/free_cache_params, which we already have in
      slab_common.c.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f817f4c
    • V
      memcg: don't call memcg_update_all_caches if new cache id fits · f3bb3043
      Vladimir Davydov 提交于
      memcg_update_all_caches grows arrays of per-memcg caches, so we only need
      to call it when memcg_limited_groups_array_size is increased.  However,
      currently we invoke it each time a new kmem-active memory cgroup is
      created.  Then it just iterates over all slab_caches and does nothing
      (memcg_update_cache_size returns immediately).
      
      This patch fixes this insanity.  In the meantime it moves the code dealing
      with id allocations to separate functions, memcg_alloc_cache_id and
      memcg_free_cache_id.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3bb3043
    • V
      memcg: move memcg_{alloc,free}_cache_params to slab_common.c · 33a690c4
      Vladimir Davydov 提交于
      The only reason why they live in memcontrol.c is that we get/put css
      reference to the owner memory cgroup in them.  However, we can do that in
      memcg_{un,}register_cache.  OTOH, there are several reasons to move them
      to slab_common.c.
      
      First, I think that the less public interface functions we have in
      memcontrol.h the better.  Since the functions I move don't depend on
      memcontrol, I think it's worth making them private to slab, especially
      taking into account that the arrays are defined on the slab's side too.
      
      Second, the way how per-memcg arrays are updated looks rather awkward: it
      proceeds from memcontrol.c (__memcg_activate_kmem) to slab_common.c
      (memcg_update_all_caches) and back to memcontrol.c again
      (memcg_update_array_size).  In the following patches I move the function
      relocating the arrays (memcg_update_array_size) to slab_common.c and
      therefore get rid this circular call path.  I think we should have the
      cache allocation stuff in the same place where we have relocation, because
      it's easier to follow the code then.  So I move arrays alloc/free
      functions to slab_common.c too.
      
      The third point isn't obvious.  I'm going to make the list_lru structure
      per-memcg to allow targeted kmem reclaim.  That means we will have
      per-memcg arrays in list_lrus too.  It turns out that it's much easier to
      update these arrays in list_lru.c rather than in memcontrol.c, because all
      the stuff we need is defined there.  This patch makes memcg caches arrays
      allocation path conform that of the upcoming list_lru.
      
      So let's move these functions to slab_common.c and make them static.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33a690c4
  4. 03 10月, 2014 1 次提交
    • J
      mm: memcontrol: do not iterate uninitialized memcgs · 2f7dd7a4
      Johannes Weiner 提交于
      The cgroup iterators yield css objects that have not yet gone through
      css_online(), but they are not complete memcgs at this point and so the
      memcg iterators should not return them.  Commit d8ad3055 ("mm/memcg:
      iteration skip memcgs not yet fully initialized") set out to implement
      exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
      not meet the ordering requirements for memcg, and so the iterator may
      skip over initialized groups, or return partially initialized memcgs.
      
      The cgroup core can not reasonably provide a clear answer on whether the
      object around the css has been fully initialized, as that depends on
      controller-specific locking and lifetime rules.  Thus, introduce a
      memcg-specific flag that is set after the memcg has been initialized in
      css_online(), and read before mem_cgroup_iter() callers access the memcg
      members.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f7dd7a4
  5. 05 9月, 2014 1 次提交
  6. 09 8月, 2014 4 次提交
    • J
      mm: memcontrol: avoid charge statistics churn during page migration · 6abb5a86
      Johannes Weiner 提交于
      Charge migration currently disables IRQs twice to update the charge
      statistics for the old page and then again for the new page.
      
      But migration is a seamless transition of a charge from one physical
      page to another one of the same size, so this should be a non-event from
      an accounting point of view.  Leave the statistics alone.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6abb5a86
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  7. 07 8月, 2014 11 次提交
  8. 31 7月, 2014 1 次提交
    • M
      memcg: oom_notify use-after-free fix · 2bcf2e92
      Michal Hocko 提交于
      Paul Furtado has reported the following GPF:
      
        general protection fault: 0000 [#1] SMP
        Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
        CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
        task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
        RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
        RSP: e02b:ffff8801d2ec7d48  EFLAGS: 00010283
        RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
        RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
        RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
        R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
        FS:  00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
        CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
        Call Trace:
          pagefault_out_of_memory+0x18/0x90
          mm_fault_error+0xa9/0x1a0
          __do_page_fault+0x478/0x4c0
          do_page_fault+0x2c/0x40
          page_fault+0x28/0x30
        Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
        RIP  mem_cgroup_oom_synchronize+0x140/0x240
      
      Commit fb2a6fc5 ("mm: memcg: rework and document OOM waiting and
      wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
      assuming it is protected by the hierarchical OOM-lock.
      
      Although this is true for the notification part the protection doesn't
      cover unregistration of event which can happen in parallel now so
      mem_cgroup_oom_notify can see already unlinked and/or freed
      mem_cgroup_eventfd_list.
      
      Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881
      
      Fixes: fb2a6fc5 (mm: memcg: rework and document OOM waiting and wakeup)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NPaul Furtado <paulfurtado91@gmail.com>
      Tested-by: NPaul Furtado <paulfurtado91@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bcf2e92
  9. 15 7月, 2014 3 次提交
    • T
      cgroup: distinguish the default and legacy hierarchies when handling cftypes · a8ddc821
      Tejun Heo 提交于
      Until now, cftype arrays carried files for both the default and legacy
      hierarchies and the files which needed to be used on only one of them
      were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE.  This
      gets confusing very quickly and we may end up exposing interface files
      to the default hierarchy without thinking it through.
      
      This patch makes cgroup core provide separate sets of interfaces for
      cftype handling so that the cftypes for the default and legacy
      hierarchies are clearly distinguished.  The previous two patches
      renamed the existing ones so that they clearly indicate that they're
      for the legacy hierarchies.  This patch adds the interface for the
      default hierarchy and apply them selectively depending on the
      hierarchy type.
      
      * cftypes added through cgroup_subsys->dfl_cftypes and
        cgroup_add_dfl_cftypes() only show up on the default hierarchy.
      
      * cftypes added through cgroup_subsys->legacy_cftypes and
        cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.
      
      * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
        same array for the cases where the interface files are identical on
        both types of hierarchies.
      
      * This makes all the existing subsystem interface files legacy-only by
        default and all subsystems will have no interface file created when
        enabled on the default hierarchy.  Each subsystem should explicitly
        review and compose the interface for the default hierarchy.
      
      * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
        makes subsystems which haven't decided the interface files for the
        default hierarchy to present the legacy files on the default
        hierarchy so that its behavior on the default hierarchy can be
        tested.  As the awkward name suggests, this is for development only.
      
      * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
        array isn't used on the default hierarchy.  The flag is removed.
      
      v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.
      
      v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
          as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      a8ddc821
    • T
      cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() · 2cf669a5
      Tejun Heo 提交于
      Currently, cftypes added by cgroup_add_cftypes() are used for both the
      unified default hierarchy and legacy ones and subsystems can mark each
      file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
      appear only on one of them.  This is quite hairy and error-prone.
      Also, we may end up exposing interface files to the default hierarchy
      without thinking it through.
      
      cgroup_subsys will grow two separate cftype addition functions and
      apply each only on the hierarchies of the matching type.  This will
      allow organizing cftypes in a lot clearer way and encourage subsystems
      to scrutinize the interface which is being exposed in the new default
      hierarchy.
      
      In preparation, this patch adds cgroup_add_legacy_cftypes() which
      currently is a simple wrapper around cgroup_add_cftypes() and replaces
      all cgroup_add_cftypes() usages with it.
      
      While at it, this patch drops a completely spurious return from
      __hugetlb_cgroup_file_init().
      
      This patch doesn't introduce any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      2cf669a5
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
  10. 09 7月, 2014 2 次提交
    • T
      cgroup: remove sane_behavior support on non-default hierarchies · aa6ec29b
      Tejun Heo 提交于
      sane_behavior has been used as a development vehicle for the default
      unified hierarchy.  Now that the default hierarchy is in place, the
      flag became redundant and confusing as its usage is allowed on all
      hierarchies.  There are gonna be either the default hierarchy or
      legacy ones.  Let's make that clear by removing sane_behavior support
      on non-default hierarchies.
      
      This patch replaces cgroup_sane_behavior() with cgroup_on_dfl().  The
      comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
      cgroup_on_dfl() with sane_behavior specific part dropped.
      
      On the default and legacy hierarchies w/o sane_behavior, this
      shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      aa6ec29b
    • T
      blkcg, memcg: make blkcg depend on memcg on the default hierarchy · 1ced953b
      Tejun Heo 提交于
      Currently, the blkio subsystem attributes all of writeback IOs to the
      root.  One of the issues is that there's no way to tell who originated
      a writeback IO from block layer.  Those IOs are usually issued
      asynchronously from a task which didn't have anything to do with
      actually generating the dirty pages.  The memory subsystem, when
      enabled, already keeps track of the ownership of each dirty page and
      it's desirable for blkio to piggyback instead of adding its own
      per-page tag.
      
      cgroup now has a mechanism to express such dependency -
      cgroup_subsys->depends_on.  This patch declares that blkcg depends on
      memcg so that memcg is enabled automatically on the default hierarchy
      when available.  Future changes will make blkcg map the memcg tag to
      find out the cgroup to blame for writeback IOs.
      
      As this means that a memcg may be made invisible, this patch also
      implements css_reset() for memcg which resets its basic
      configurations.  This implementation will probably need to be expanded
      to cover other states which are used in the default hierarchy.
      
      v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
          build failure.  Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      1ced953b