1. 03 10月, 2014 1 次提交
    • J
      mm: memcontrol: do not iterate uninitialized memcgs · 2f7dd7a4
      Johannes Weiner 提交于
      The cgroup iterators yield css objects that have not yet gone through
      css_online(), but they are not complete memcgs at this point and so the
      memcg iterators should not return them.  Commit d8ad3055 ("mm/memcg:
      iteration skip memcgs not yet fully initialized") set out to implement
      exactly this, but it uses CSS_ONLINE, a cgroup-internal flag that does
      not meet the ordering requirements for memcg, and so the iterator may
      skip over initialized groups, or return partially initialized memcgs.
      
      The cgroup core can not reasonably provide a clear answer on whether the
      object around the css has been fully initialized, as that depends on
      controller-specific locking and lifetime rules.  Thus, introduce a
      memcg-specific flag that is set after the memcg has been initialized in
      css_online(), and read before mem_cgroup_iter() callers access the memcg
      members.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f7dd7a4
  2. 05 9月, 2014 1 次提交
  3. 09 8月, 2014 4 次提交
    • J
      mm: memcontrol: avoid charge statistics churn during page migration · 6abb5a86
      Johannes Weiner 提交于
      Charge migration currently disables IRQs twice to update the charge
      statistics for the old page and then again for the new page.
      
      But migration is a seamless transition of a charge from one physical
      page to another one of the same size, so this should be a non-event from
      an accounting point of view.  Leave the statistics alone.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6abb5a86
    • J
      mm: memcontrol: use page lists for uncharge batching · 747db954
      Johannes Weiner 提交于
      Pages are now uncharged at release time, and all sources of batched
      uncharges operate on lists of pages.  Directly use those lists, and
      get rid of the per-task batching state.
      
      This also batches statistics accounting, in addition to the res
      counter charges, to reduce IRQ-disabling and re-enabling.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      747db954
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  4. 07 8月, 2014 11 次提交
  5. 31 7月, 2014 1 次提交
    • M
      memcg: oom_notify use-after-free fix · 2bcf2e92
      Michal Hocko 提交于
      Paul Furtado has reported the following GPF:
      
        general protection fault: 0000 [#1] SMP
        Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
        CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
        task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
        RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
        RSP: e02b:ffff8801d2ec7d48  EFLAGS: 00010283
        RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
        RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
        RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
        R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
        R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
        FS:  00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
        CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
        Call Trace:
          pagefault_out_of_memory+0x18/0x90
          mm_fault_error+0xa9/0x1a0
          __do_page_fault+0x478/0x4c0
          do_page_fault+0x2c/0x40
          page_fault+0x28/0x30
        Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
        RIP  mem_cgroup_oom_synchronize+0x140/0x240
      
      Commit fb2a6fc5 ("mm: memcg: rework and document OOM waiting and
      wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
      assuming it is protected by the hierarchical OOM-lock.
      
      Although this is true for the notification part the protection doesn't
      cover unregistration of event which can happen in parallel now so
      mem_cgroup_oom_notify can see already unlinked and/or freed
      mem_cgroup_eventfd_list.
      
      Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.
      
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881
      
      Fixes: fb2a6fc5 (mm: memcg: rework and document OOM waiting and wakeup)
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Reported-by: NPaul Furtado <paulfurtado91@gmail.com>
      Tested-by: NPaul Furtado <paulfurtado91@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bcf2e92
  6. 15 7月, 2014 3 次提交
    • T
      cgroup: distinguish the default and legacy hierarchies when handling cftypes · a8ddc821
      Tejun Heo 提交于
      Until now, cftype arrays carried files for both the default and legacy
      hierarchies and the files which needed to be used on only one of them
      were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE.  This
      gets confusing very quickly and we may end up exposing interface files
      to the default hierarchy without thinking it through.
      
      This patch makes cgroup core provide separate sets of interfaces for
      cftype handling so that the cftypes for the default and legacy
      hierarchies are clearly distinguished.  The previous two patches
      renamed the existing ones so that they clearly indicate that they're
      for the legacy hierarchies.  This patch adds the interface for the
      default hierarchy and apply them selectively depending on the
      hierarchy type.
      
      * cftypes added through cgroup_subsys->dfl_cftypes and
        cgroup_add_dfl_cftypes() only show up on the default hierarchy.
      
      * cftypes added through cgroup_subsys->legacy_cftypes and
        cgroup_add_legacy_cftypes() only show up on the legacy hierarchies.
      
      * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the
        same array for the cases where the interface files are identical on
        both types of hierarchies.
      
      * This makes all the existing subsystem interface files legacy-only by
        default and all subsystems will have no interface file created when
        enabled on the default hierarchy.  Each subsystem should explicitly
        review and compose the interface for the default hierarchy.
      
      * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which
        makes subsystems which haven't decided the interface files for the
        default hierarchy to present the legacy files on the default
        hierarchy so that its behavior on the default hierarchy can be
        tested.  As the awkward name suggests, this is for development only.
      
      * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole
        array isn't used on the default hierarchy.  The flag is removed.
      
      v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl.
      
      v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed
          as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      a8ddc821
    • T
      cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() · 2cf669a5
      Tejun Heo 提交于
      Currently, cftypes added by cgroup_add_cftypes() are used for both the
      unified default hierarchy and legacy ones and subsystems can mark each
      file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to
      appear only on one of them.  This is quite hairy and error-prone.
      Also, we may end up exposing interface files to the default hierarchy
      without thinking it through.
      
      cgroup_subsys will grow two separate cftype addition functions and
      apply each only on the hierarchies of the matching type.  This will
      allow organizing cftypes in a lot clearer way and encourage subsystems
      to scrutinize the interface which is being exposed in the new default
      hierarchy.
      
      In preparation, this patch adds cgroup_add_legacy_cftypes() which
      currently is a simple wrapper around cgroup_add_cftypes() and replaces
      all cgroup_add_cftypes() usages with it.
      
      While at it, this patch drops a completely spurious return from
      __hugetlb_cgroup_file_init().
      
      This patch doesn't introduce any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      2cf669a5
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
  7. 09 7月, 2014 2 次提交
    • T
      cgroup: remove sane_behavior support on non-default hierarchies · aa6ec29b
      Tejun Heo 提交于
      sane_behavior has been used as a development vehicle for the default
      unified hierarchy.  Now that the default hierarchy is in place, the
      flag became redundant and confusing as its usage is allowed on all
      hierarchies.  There are gonna be either the default hierarchy or
      legacy ones.  Let's make that clear by removing sane_behavior support
      on non-default hierarchies.
      
      This patch replaces cgroup_sane_behavior() with cgroup_on_dfl().  The
      comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
      cgroup_on_dfl() with sane_behavior specific part dropped.
      
      On the default and legacy hierarchies w/o sane_behavior, this
      shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      aa6ec29b
    • T
      blkcg, memcg: make blkcg depend on memcg on the default hierarchy · 1ced953b
      Tejun Heo 提交于
      Currently, the blkio subsystem attributes all of writeback IOs to the
      root.  One of the issues is that there's no way to tell who originated
      a writeback IO from block layer.  Those IOs are usually issued
      asynchronously from a task which didn't have anything to do with
      actually generating the dirty pages.  The memory subsystem, when
      enabled, already keeps track of the ownership of each dirty page and
      it's desirable for blkio to piggyback instead of adding its own
      per-page tag.
      
      cgroup now has a mechanism to express such dependency -
      cgroup_subsys->depends_on.  This patch declares that blkcg depends on
      memcg so that memcg is enabled automatically on the default hierarchy
      when available.  Future changes will make blkcg map the memcg tag to
      find out the cgroup to blame for writeback IOs.
      
      As this means that a memcg may be made invisible, this patch also
      implements css_reset() for memcg which resets its basic
      configurations.  This implementation will probably need to be expanded
      to cover other states which are used in the default hierarchy.
      
      v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid
          build failure.  Reported by kbuild test robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      1ced953b
  8. 07 6月, 2014 3 次提交
  9. 05 6月, 2014 14 次提交
    • H
      mm, memcg: periodically schedule when emptying page list · 2a7a0e0f
      Hugh Dickins 提交于
      mem_cgroup_force_empty_list() can iterate a large number of pages on an
      lru and mem_cgroup_move_parent() doesn't return an errno unless certain
      criteria, none of which indicate that the iteration may be taking too
      long, is met.
      
      We have encountered the following stack trace many times indicating
      "need_resched set for > 51000020 ns (51 ticks) without schedule", for
      example:
      
      	scheduler_tick()
      	<timer irq>
      	mem_cgroup_move_account+0x4d/0x1d5
      	mem_cgroup_move_parent+0x8d/0x109
      	mem_cgroup_reparent_charges+0x149/0x2ba
      	mem_cgroup_css_offline+0xeb/0x11b
      	cgroup_offline_fn+0x68/0x16b
      	process_one_work+0x129/0x350
      
      If this iteration is taking too long, we still need to do cond_resched()
      even when an individual page is not busy.
      
      [rientjes@google.com: changelog]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a7a0e0f
    • V
      memcg: cleanup kmem cache creation/destruction functions naming · 776ed0f0
      Vladimir Davydov 提交于
      Current names are rather inconsistent. Let's try to improve them.
      
      Brief change log:
      
      ** old name **                          ** new name **
      
      kmem_cache_create_memcg                 memcg_create_kmem_cache
      memcg_kmem_create_cache                 memcg_regsiter_cache
      memcg_kmem_destroy_cache                memcg_unregister_cache
      
      kmem_cache_destroy_memcg_children       memcg_cleanup_cache_params
      mem_cgroup_destroy_all_caches           memcg_unregister_all_caches
      
      create_work                             memcg_register_cache_work
      memcg_create_cache_work_func            memcg_register_cache_func
      memcg_create_cache_enqueue              memcg_schedule_register_cache
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      776ed0f0
    • V
      memcg: memcg_kmem_create_cache: make memcg_name_buf statically allocated · 93f39eea
      Vladimir Davydov 提交于
      It isn't worth complicating the code by allocating it on the first access,
      because it only takes 256 bytes.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93f39eea
    • V
      memcg: get rid of memcg_create_cache_name · 073ee1c6
      Vladimir Davydov 提交于
      Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
      order to just create the name of a per memcg cache, let's allocate it in
      place.  We only need to pass the memcg name to kmem_cache_create_memcg for
      that - everything else can be done in slab_common.c.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073ee1c6
    • Q
      memcg: correct comments for __mem_cgroup_begin_update_page_stat · b5ffc856
      Qiang Huang 提交于
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5ffc856
    • Q
      memcg: fold mem_cgroup_stolen · bdcbb659
      Qiang Huang 提交于
      It is only used in __mem_cgroup_begin_update_page_stat(), the name is
      confusing and 2 routines for one thing also confuse people, so fold this
      function seems more clear.
      
      [akpm@linux-foundation.org: fix typo, per Michal]
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdcbb659
    • F
      mm/memcontrol.c: remove NULL assignment on static · ada4ba59
      Fabian Frederick 提交于
      static values are automatically initialized to NULL
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ada4ba59
    • C
      mm: replace __get_cpu_var uses with this_cpu_ptr · 7c8e0181
      Christoph Lameter 提交于
      Replace places where __get_cpu_var() is used for an address calculation
      with this_cpu_ptr().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c8e0181
    • V
      memcg, slab: simplify synchronization scheme · bd673145
      Vladimir Davydov 提交于
      At present, we have the following mutexes protecting data related to per
      memcg kmem caches:
      
       - slab_mutex.  This one is held during the whole kmem cache creation
         and destruction paths.  We also take it when updating per root cache
         memcg_caches arrays (see memcg_update_all_caches).  As a result, taking
         it guarantees there will be no changes to any kmem cache (including per
         memcg).  Why do we need something else then?  The point is it is
         private to slab implementation and has some internal dependencies with
         other mutexes (get_online_cpus).  So we just don't want to rely upon it
         and prefer to introduce additional mutexes instead.
      
       - activate_kmem_mutex.  Initially it was added to synchronize
         initializing kmem limit (memcg_activate_kmem).  However, since we can
         grow per root cache memcg_caches arrays only on kmem limit
         initialization (see memcg_update_all_caches), we also employ it to
         protect against memcg_caches arrays relocation (e.g.  see
         __kmem_cache_destroy_memcg_children).
      
       - We have a convention not to take slab_mutex in memcontrol.c, but we
         want to walk over per memcg memcg_slab_caches lists there (e.g.  for
         destroying all memcg caches on offline).  So we have per memcg
         slab_caches_mutex's protecting those lists.
      
      The mutexes are taken in the following order:
      
         activate_kmem_mutex -> slab_mutex -> memcg::slab_caches_mutex
      
      Such a syncrhonization scheme has a number of flaws, for instance:
      
       - We can't call kmem_cache_{destroy,shrink} while walking over a
         memcg::memcg_slab_caches list due to locking order.  As a result, in
         mem_cgroup_destroy_all_caches we schedule the
         memcg_cache_params::destroy work shrinking and destroying the cache.
      
       - We don't have a mutex to synchronize per memcg caches destruction
         between memcg offline (mem_cgroup_destroy_all_caches) and root cache
         destruction (__kmem_cache_destroy_memcg_children).  Currently we just
         don't bother about it.
      
      This patch simplifies it by substituting per memcg slab_caches_mutex's
      with the global memcg_slab_mutex.  It will be held whenever a new per
      memcg cache is created or destroyed, so it protects per root cache
      memcg_caches arrays and per memcg memcg_slab_caches lists.  The locking
      order is following:
      
         activate_kmem_mutex -> memcg_slab_mutex -> slab_mutex
      
      This allows us to call kmem_cache_{create,shrink,destroy} under the
      memcg_slab_mutex.  As a result, we don't need memcg_cache_params::destroy
      work any more - we can simply destroy caches while iterating over a per
      memcg slab caches list.
      
      Also using the global mutex simplifies synchronization between concurrent
      per memcg caches creation/destruction, e.g.  mem_cgroup_destroy_all_caches
      vs __kmem_cache_destroy_memcg_children.
      
      The downside of this is that we substitute per-memcg slab_caches_mutex's
      with a hummer-like global mutex, but since we already take either the
      slab_mutex or the cgroup_mutex along with a memcg::slab_caches_mutex, it
      shouldn't hurt concurrency a lot.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd673145
    • V
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov 提交于
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • V
      memcg, slab: do not schedule cache destruction when last page goes away · 1e32e77f
      Vladimir Davydov 提交于
      This patchset is a part of preparations for kmemcg re-parenting.  It
      targets at simplifying kmemcg work-flows and synchronization.
      
      First, it removes async per memcg cache destruction (see patches 1, 2).
      Now caches are only destroyed on memcg offline.  That means the caches
      that are not empty on memcg offline will be leaked.  However, they are
      already leaked, because memcg_cache_params::nr_pages normally never drops
      to 0 so the destruction work is never scheduled except kmem_cache_shrink
      is called explicitly.  In the future I'm planning reaping such dead caches
      on vmpressure or periodically.
      
      Second, it substitutes per memcg slab_caches_mutex's with the global
      memcg_slab_mutex, which should be taken during the whole per memcg cache
      creation/destruction path before the slab_mutex (see patch 3).  This
      greatly simplifies synchronization among various per memcg cache
      creation/destruction paths.
      
      I'm still not quite sure about the end picture, in particular I don't know
      whether we should reap dead memcgs' kmem caches periodically or try to
      merge them with their parents (see https://lkml.org/lkml/2014/4/20/38 for
      more details), but whichever way we choose, this set looks like a
      reasonable change to me, because it greatly simplifies kmemcg work-flows
      and eases further development.
      
      This patch (of 3):
      
      After a memcg is offlined, we mark its kmem caches that cannot be deleted
      right now due to pending objects as dead by setting the
      memcg_cache_params::dead flag, so that memcg_release_pages will schedule
      cache destruction (memcg_cache_params::destroy) as soon as the last slab
      of the cache is freed (memcg_cache_params::nr_pages drops to zero).
      
      I guess the idea was to destroy the caches as soon as possible, i.e.
      immediately after freeing the last object.  However, it just doesn't work
      that way, because kmem caches always preserve some pages for the sake of
      performance, so that nr_pages never gets to zero unless the cache is
      shrunk explicitly using kmem_cache_shrink.  Of course, we could account
      the total number of objects on the cache or check if all the slabs
      allocated for the cache are empty on kmem_cache_free and schedule
      destruction if so, but that would be too costly.
      
      Thus we have a piece of code that works only when we explicitly call
      kmem_cache_shrink, but complicates the whole picture a lot.  Moreover,
      it's racy in fact.  For instance, kmem_cache_shrink may free the last slab
      and thus schedule cache destruction before it finishes checking that the
      cache is empty, which can lead to use-after-free.
      
      So I propose to remove this async cache destruction from
      memcg_release_pages, and check if the cache is empty explicitly after
      calling kmem_cache_shrink instead.  This will simplify things a lot w/o
      introducing any functional changes.
      
      And regarding dead memcg caches (i.e.  those that are left hanging around
      after memcg offline for they have objects), I suppose we should reap them
      either periodically or on vmpressure as Glauber suggested initially.  I'm
      going to implement this later.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e32e77f
    • M
      memcg: do not hang on OOM when killed by userspace OOM access to memory reserves · d8dc595c
      Michal Hocko 提交于
      Eric has reported that he can see task(s) stuck in memcg OOM handler
      regularly.  The only way out is to
      
      	echo 0 > $GROUP/memory.oom_control
      
      His usecase is:
      
      - Setup a hierarchy with memory and the freezer (disable kernel oom and
        have a process watch for oom).
      
      - In that memory cgroup add a process with one thread per cpu.
      
      - In one thread slowly allocate once per second I think it is 16M of ram
        and mlock and dirty it (just to force the pages into ram and stay
        there).
      
      - When oom is achieved loop:
        * attempt to freeze all of the tasks.
        * if frozen send every task SIGKILL, unfreeze, remove the directory in
          cgroupfs.
      
      Eric has then pinpointed the issue to be memcg specific.
      
      All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
      Those that have received fatal signal will bypass the charge and should
      continue on their way out.  The tricky part is that the exit path might
      trigger a page fault (e.g.  exit_robust_list), thus the memcg charge,
      while its memcg is still under OOM because nobody has released any charges
      yet.
      
      Unlike with the in-kernel OOM handler the exiting task doesn't get
      TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
      and falls to the memcg OOM again without any way out of it as there are no
      fatal signals pending anymore.
      
      This patch fixes the issue by checking PF_EXITING early in
      mem_cgroup_try_charge and bypass the charge same as if it had fatal
      signal pending or TIF_MEMDIE set.
      
      Normally exiting tasks (aka not killed) will bypass the charge now but
      this should be OK as the task is leaving and will release memory and
      increasing the memory pressure just to release it in a moment seems
      dubious wasting of cycles.  Besides that charges after exit_signals should
      be rare.
      
      I am bringing this patch again (rebased on the current mmotm tree). I
      hope we can move forward finally. If there is still an opposition then
      I would really appreciate a concurrent approach so that we can discuss
      alternatives.
      
      http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
      to the followup discussion when the patch has been dropped from the mmotm
      last time.
      Reported-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8dc595c
    • V
      memcg: un-export __memcg_kmem_get_cache · e8d9df3a
      Vladimir Davydov 提交于
      It is only used in slab and should not be used anywhere else so there is
      no need in exporting it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8d9df3a
    • J
      mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control · 3dae7fec
      Johannes Weiner 提交于
      Per-memcg swappiness and oom killing can currently not be tweaked on a
      memcg that is part of a hierarchy, but not the root of that hierarchy.
      Users have complained that they can't configure this when they turned on
      hierarchy mode.  In fact, with hierarchy mode becoming the default, this
      restriction disables the tunables entirely.
      
      But there is no good reason for this restriction.  The settings for
      swappiness and OOM killing are taken from whatever memcg whose limit
      triggered reclaim and OOM invocation, regardless of its position in the
      hierarchy tree.
      
      Allow setting swappiness on any group.  The knob on the root memcg
      already reads the global VM swappiness, make it writable as well.
      
      Allow disabling the OOM killer on any non-root memcg.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dae7fec