1. 15 1月, 2016 8 次提交
  2. 30 12月, 2015 1 次提交
    • V
      mm: memcontrol: fix possible memcg leak due to interrupted reclaim · 6df38689
      Vladimir Davydov 提交于
      Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
      once enough pages have been reclaimed, in which case, in contrast to a
      full round-trip over a cgroup sub-tree, the current position stored in
      mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
      and so is left holding the reference to the last scanned cgroup.  If the
      target cgroup does not get scanned again (we might have just reclaimed
      the last page or all processes might exit and free their memory
      voluntary), we will leak it, because there is nobody to put the
      reference held by the iterator.
      
      The problem is easy to reproduce by running the following command
      sequence in a loop:
      
          mkdir /sys/fs/cgroup/memory/test
          echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
          memhog 150M
          echo $$ > /sys/fs/cgroup/memory/cgroup.procs
          rmdir test
      
      The cgroups generated by it will never get freed.
      
      This patch fixes this issue by making mem_cgroup_iter avoid taking
      reference to the current position.  In order not to hit use-after-free
      bug while running reclaim in parallel with cgroup deletion, we make use
      of ->css_released cgroup callback to clear references to the dying
      cgroup in all reclaim iterators that might refer to it.  This callback
      is called right before scheduling rcu work which will free css, so if we
      access iter->position from rcu read section, we might be sure it won't
      go away under us.
      
      [hannes@cmpxchg.org: clean up css ref handling]
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df38689
  3. 28 12月, 2015 1 次提交
    • R
      cgroup: Fix uninitialized variable warning · eed67d75
      Ross Zwisler 提交于
      Commit 1f7dd3e5 ("cgroup: fix handling of multi-destination migration
      from subtree_control enabling") introduced the following compiler warning:
      
      mm/memcontrol.c: In function ‘mem_cgroup_can_attach’:
      mm/memcontrol.c:4790:9: warning: ‘memcg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
         mc.to = memcg;
               ^
      
      Fix this by initializing 'memcg' to NULL.
      
      This was found using gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      eed67d75
  4. 13 12月, 2015 2 次提交
  5. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  6. 07 11月, 2015 3 次提交
    • A
      mm/memcontrol.c: uninline mem_cgroup_usage · 6f646156
      Andrew Morton 提交于
      gcc version 5.2.1 20151010 (Debian 5.2.1-22)
      $ size mm/memcontrol.o mm/memcontrol.o.before
         text    data     bss     dec     hex filename
        35535    7908      64   43507    a9f3 mm/memcontrol.o
        35762    7908      64   43734    aad6 mm/memcontrol.o.before
      
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f646156
    • M
      mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM · 71baba4b
      Mel Gorman 提交于
      __GFP_WAIT was used to signal that the caller was in atomic context and
      could not sleep.  Now it is possible to distinguish between true atomic
      context and callers that are not willing to sleep.  The latter should
      clear __GFP_DIRECT_RECLAIM so kswapd will still wake.  As clearing
      __GFP_WAIT behaves differently, there is a risk that people will clear the
      wrong flags.  This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
      indicate what it does -- setting it allows all reclaim activity, clearing
      them prevents it.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71baba4b
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  7. 06 11月, 2015 11 次提交
    • M
      memcg: fix thresholds for 32b architectures. · c12176d3
      Michal Hocko 提交于
      Commit 424cdc14 ("memcg: convert threshold to bytes") has fixed a
      regression introduced by 3e32cb2e ("mm: memcontrol: lockless page
      counters") where thresholds were silently converted to use page units
      rather than bytes when interpreting the user input.
      
      The fix is not complete, though, as properly pointed out by Ben Hutchings
      during stable backport review.  The page count is converted to bytes but
      unsigned long is used to hold the value which would be obviously not
      sufficient for 32b systems with more than 4G thresholds.  The same applies
      to usage as taken from mem_cgroup_usage which might overflow.
      
      Let's remove this bytes vs.  pages internal tracking differences and
      handle thresholds in page units internally.  Chage mem_cgroup_usage() to
      return the value in page units and revert 424cdc14 because this should
      be sufficient for the consistent handling.  mem_cgroup_read_u64 as the
      only users of mem_cgroup_usage outside of the threshold handling code is
      converted to give the proper in bytes result.  It is doing that already
      for page_counter output so this is more consistent as well.
      
      The value presented to the userspace is still in bytes units.
      
      Fixes: 424cdc14 ("memcg: convert threshold to bytes")
      Fixes: 3e32cb2e ("mm: memcontrol: lockless page counters")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBen Hutchings <ben@decadent.org.uk>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      From: Michal Hocko <mhocko@kernel.org>
      Subject: memcg-fix-thresholds-for-32b-architectures-fix
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      From: Andrew Morton <akpm@linux-foundation.org>
      Subject: memcg-fix-thresholds-for-32b-architectures-fix-fix
      
      don't attempt to inline mem_cgroup_usage()
      
      The compiler ignores the inline anwyay.  And __always_inlining it adds 600
      bytes of goop to the .o file.
      
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c12176d3
    • J
      mm: page_counter: let page_counter_try_charge() return bool · 6071ca52
      Johannes Weiner 提交于
      page_counter_try_charge() currently returns 0 on success and -ENOMEM on
      failure, which is surprising behavior given the function name.
      
      Make it follow the expected pattern of try_stuff() functions that return a
      boolean true to indicate success, or false for failure.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6071ca52
    • J
      mm: memcontrol: eliminate root memory.current · f5fc3c5d
      Johannes Weiner 提交于
      memory.current on the root level doesn't add anything that wouldn't be
      more accurate and detailed using system statistics.  It already doesn't
      include slabs, and it'll be a pain to keep in sync when further memory
      types are accounted in the memory controller.  Remove it.
      
      Note that this applies to the new unified hierarchy interface only.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5fc3c5d
    • H
      mm: rename mem_cgroup_migrate to mem_cgroup_replace_page · 45637bab
      Hugh Dickins 提交于
      After v4.3's commit 0610c25d ("memcg: fix dirty page migration")
      mem_cgroup_migrate() doesn't have much to offer in page migration: convert
      migrate_misplaced_transhuge_page() to set_page_memcg() instead.
      
      Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
      remaining callers are replace_page_cache_page() and shmem_replace_page():
      both of whom passed lrucare true, so just eliminate that argument.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45637bab
    • V
      memcg: simplify and inline __mem_cgroup_from_kmem · df406551
      Vladimir Davydov 提交于
      Before the previous patch ("memcg: unify slab and other kmem pages
      charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
      pages and pages allocated with alloc_kmem_pages - memcg in the page
      struct.  Now we can unify it.  Since after it, this function becomes tiny
      we can fold it into mem_cgroup_from_kmem.
      
      [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df406551
    • V
      memcg: unify slab and other kmem pages charging · f3ccb2c4
      Vladimir Davydov 提交于
      We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
      uncharging kmem pages to memcg, but currently they are not used for
      charging slab pages (i.e.  they are only used for charging pages allocated
      with alloc_kmem_pages).  The only reason why the slab subsystem uses
      special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
      needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
      to the memcg that the current task belongs to.
      
      To remove this diversity, this patch adds an extra argument to
      __memcg_kmem_charge that can be a pointer to a memcg or NULL.  If it is
      not NULL, the function tries to charge to the memcg it points to,
      otherwise it charge to the current context.  Next, it makes the slab
      subsystem use this function to charge slab pages.
      
      Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
      in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined.  Since
      __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
      don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
      Besides, one can now detect which memcg a slab page belongs to by reading
      /proc/kpagecgroup.
      
      Note, this patch switches slab to charge-after-alloc design.  Since this
      design is already used for all other memcg charges, it should not make any
      difference.
      
      [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3ccb2c4
    • V
      memcg: simplify charging kmem pages · d05e83a6
      Vladimir Davydov 提交于
      Charging kmem pages proceeds in two steps.  First, we try to charge the
      allocation size to the memcg the current task belongs to, then we allocate
      a page and "commit" the charge storing the pointer to the memcg in the
      page struct.
      
      Such a design looks overcomplicated, because there is not much sense in
      trying charging the allocation before actually allocating a page: we won't
      be able to consume much memory over the limit even if we charge after
      doing the actual allocation, besides we already charge user pages post
      factum, so being pedantic with kmem pages just looks pointless.
      
      So this patch simplifies the design by merging the "charge" and the
      "commit" steps into the same function, which takes the allocated page.
      
      Also, rename the charge and uncharge methods to memcg_kmem_charge and
      memcg_kmem_uncharge and make the charge method return error code instead
      of bool to conform to mem_cgroup_try_charge.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d05e83a6
    • J
      mm/memcontrol.c: fix order calculation in try_charge() · 3608de07
      Jerome Marchand 提交于
      Since commit 6539cc05 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
      the order to pass to mem_cgroup_oom() is calculated by passing the
      number of pages to get_order() instead of the expected size in bytes.
      AFAICT, it only affects the value displayed in the oom warning message.
      This patch fix this.
      
      Michal said:
      
      : We haven't noticed that just because the OOM is enabled only for page
      : faults of order-0 (single page) and get_order work just fine.  Thanks for
      : noticing this.  If we ever start triggering OOM on different orders this
      : would be broken.
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3608de07
    • T
      memcg: ratify and consolidate over-charge handling · 10d53c74
      Tejun Heo 提交于
      try_charge() is the main charging logic of memcg.  When it hits the limit
      but either can't fail the allocation due to __GFP_NOFAIL or the task is
      likely to free memory very soon, being OOM killed, has SIGKILL pending or
      exiting, it "bypasses" the charge to the root memcg and returns -EINTR.
      While this is one approach which can be taken for these situations, it has
      several issues.
      
      * It unnecessarily lies about the reality.  The number itself doesn't
        go over the limit but the actual usage does.  memcg is either forced
        to or actively chooses to go over the limit because that is the
        right behavior under the circumstances, which is completely fine,
        but, if at all avoidable, it shouldn't be misrepresenting what's
        happening by sneaking the charges into the root memcg.
      
      * Despite trying, we already do over-charge.  kmemcg can't deal with
        switching over to the root memcg by the point try_charge() returns
        -EINTR, so it open-codes over-charing.
      
      * It complicates the callers.  Each try_charge() user has to handle
        the weird -EINTR exception.  memcg_charge_kmem() does the manual
        over-charging.  mem_cgroup_do_precharge() performs unnecessary
        uncharging of root memcg, which BTW is inconsistent with what
        memcg_charge_kmem() does but not broken as [un]charging are noops on
        root memcg.  mem_cgroup_try_charge() needs to switch the returned
        cgroup to the root one.
      
      The reality is that in memcg there are cases where we are forced and/or
      willing to go over the limit.  Each such case needs to be scrutinized and
      justified but there definitely are situations where that is the right
      thing to do.  We alredy do this but with a superficial and inconsistent
      disguise which leads to unnecessary complications.
      
      This patch updates try_charge() so that it over-charges and returns 0 when
      deemed necessary.  -EINTR return is removed along with all special case
      handling in the callers.
      
      While at it, remove the local variable @ret, which was initialized to zero
      and never changed, along with done: label which just returned the always
      zero @ret.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d53c74
    • T
      memcg: punt high overage reclaim to return-to-userland path · b23afb93
      Tejun Heo 提交于
      Currently, try_charge() tries to reclaim memory synchronously when the
      high limit is breached; however, if the allocation doesn't have
      __GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
      speculative allocations, it can blow way past the high limit.  This is
      actually easily reproducible by simply doing "find /".  slab/slub
      allocator tries speculative allocations first, so as long as there's
      memory which can be consumed without blocking, it can keep allocating
      memory regardless of the high limit.
      
      This patch makes try_charge() always punt the over-high reclaim to the
      return-to-userland path.  If try_charge() detects that high limit is
      breached, it adds the overage to current->memcg_nr_pages_over_high and
      schedules execution of mem_cgroup_handle_over_high() which performs
      synchronous reclaim from the return-to-userland path.
      
      As long as kernel doesn't have a run-away allocation spree, this should
      provide enough protection while making kmemcg behave more consistently.
      It also has the following benefits.
      
      - All over-high reclaims can use GFP_KERNEL regardless of the specific
        gfp mask in use, e.g. GFP_NOFS, when the limit was breached.
      
      - It copes with prio inversion.  Previously, a low-prio task with
        small memory.high might perform over-high reclaim with a bunch of
        locks held.  If a higher prio task needed any of these locks, it
        would have to wait until the low prio task finished reclaim and
        released the locks.  By handing over-high reclaim to the task exit
        path this issue can be avoided.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b23afb93
    • T
      memcg: flatten task_struct->memcg_oom · 626ebc41
      Tejun Heo 提交于
      task_struct->memcg_oom is a sub-struct containing fields which are used
      for async memcg oom handling.  Most task_struct fields aren't packaged
      this way and it can lead to unnecessary alignment paddings.  This patch
      flattens it.
      
      * task.memcg_oom.memcg          -> task.memcg_in_oom
      * task.memcg_oom.gfp_mask	-> task.memcg_oom_gfp_mask
      * task.memcg_oom.order          -> task.memcg_oom_order
      * task.memcg_oom.may_oom        -> task.memcg_may_oom
      
      In addition, task.memcg_may_oom is relocated to where other bitfields are
      which reduces the size of task_struct.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      626ebc41
  8. 17 10月, 2015 1 次提交
  9. 16 10月, 2015 1 次提交
    • T
      cgroup: replace cgroup_has_tasks() with cgroup_is_populated() · 27bd4dbb
      Tejun Heo 提交于
      Currently, cgroup_has_tasks() tests whether the target cgroup has any
      css_set linked to it.  This works because a css_set's refcnt converges
      with the number of tasks linked to it and thus there's no css_set
      linked to a cgroup if it doesn't have any live tasks.
      
      To help tracking resource usage of zombie tasks, putting the ref of
      css_set will be separated from disassociating the task from the
      css_set which means that a cgroup may have css_sets linked to it even
      when it doesn't have any live tasks.
      
      This patch replaces cgroup_has_tasks() with cgroup_is_populated()
      which tests cgroup->nr_populated instead which locally counts the
      number of populated css_sets.  Unlike cgroup_has_tasks(),
      cgroup_is_populated() is recursive - if any of the descendants is
      populated, the cgroup is populated too.  While this changes the
      meaning of the test, all the existing users are okay with the change.
      
      While at it, replace the open-coded ->populated_cnt test in
      cgroup_events_show() with cgroup_is_populated().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      27bd4dbb
  10. 13 10月, 2015 1 次提交
    • T
      writeback: fix incorrect calculation of available memory for memcg domains · c5edf9cd
      Tejun Heo 提交于
      For memcg domains, the amount of available memory was calculated as
      
       min(the amount currently in use + headroom according to memcg,
           total clean memory)
      
      This isn't quite correct as what should be capped by the amount of
      clean memory is the headroom, not the sum of memory in use and
      headroom.  For example, if a memcg domain has a significant amount of
      dirty memory, the above can lead to a value which is lower than the
      current amount in use which doesn't make much sense.  In most
      circumstances, the above leads to a number which is somewhat but not
      drastically lower.
      
      As the amount of memory which can be readily allocated to the memcg
      domain is capped by the amount of system-wide clean memory which is
      not already assigned to the memcg itself, the number we want is
      
       the amount currently in use +
       min(headroom according to memcg, clean memory elsewhere in the system)
      
      This patch updates mem_cgroup_wb_stats() to return the number of
      filepages and headroom instead of the calculated available pages.
      mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the
      above calculation from file, headroom, dirty and globally clean pages.
      
      v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading
          to build failure when !CGROUP_WRITEBACK.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c5edf9cd
  11. 02 10月, 2015 2 次提交
  12. 23 9月, 2015 1 次提交
    • T
      cgroup, memcg, cpuset: implement cgroup_taskset_for_each_leader() · 4530eddb
      Tejun Heo 提交于
      It wasn't explicitly documented but, when a process is being migrated,
      cpuset and memcg depend on cgroup_taskset_first() returning the
      threadgroup leader; however, this approach is somewhat ghetto and
      would no longer work for the planned multi-process migration.
      
      This patch introduces explicit cgroup_taskset_for_each_leader() which
      iterates over only the threadgroup leaders and replaces
      cgroup_taskset_first() usages for accessing the leader with it.
      
      This prepares both memcg and cpuset for multi-process migration.  This
      patch also updates the documentation for cgroup_taskset_for_each() to
      clarify the iteration rules and removes comments mentioning task
      ordering in tasksets.
      
      v2: A previous patch which added threadgroup leader test was dropped.
          Patch updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      4530eddb
  13. 22 9月, 2015 1 次提交
  14. 19 9月, 2015 1 次提交
    • T
      cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE · 7dbdb199
      Tejun Heo 提交于
      cftype->mode allows controllers to give arbitrary permissions to
      interface knobs.  Except for "cgroup.event_control", the existing uses
      are spurious.
      
      * Some explicitly specify S_IRUGO | S_IWUSR even though that's the
        default.
      
      * "cpuset.memory_pressure" specifies S_IRUGO while also setting a
        write callback which returns -EACCES.  All it needs to do is simply
        not setting a write callback.
      
      "cgroup.event_control" uses cftype->mode to make the file
      world-writable.  It's a misdesigned interface and we don't want
      controllers to be tweaking interface file permissions in general.
      This patch removes cftype->mode and all its spurious uses and
      implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
      marked as compatibility-only.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      7dbdb199
  15. 18 9月, 2015 1 次提交
    • T
      cgroup: replace cgroup_on_dfl() tests in controllers with cgroup_subsys_on_dfl() · 9e10a130
      Tejun Heo 提交于
      cgroup_on_dfl() tests whether the cgroup's root is the default
      hierarchy; however, an individual controller is only interested in
      whether the controller is attached to the default hierarchy and never
      tests a cgroup which doesn't belong to the hierarchy that the
      controller is attached to.
      
      This patch replaces cgroup_on_dfl() tests in controllers with faster
      static_key based cgroup_subsys_on_dfl().  This leaves cgroup core as
      the only user of cgroup_on_dfl() and the function is moved from the
      header file to cgroup.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      9e10a130
  16. 11 9月, 2015 2 次提交
    • V
      memcg: zap try_get_mem_cgroup_from_page · e993d905
      Vladimir Davydov 提交于
      It is only used in mem_cgroup_try_charge, so fold it in and zap it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e993d905
    • V
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov 提交于
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      
      ==== USE CASES ====
      
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      
      ==== USER API ====
      
      The user API consists of two new files:
      
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         CONFIG_IDLE_PAGE_TRACKING is set.
      
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
      
         1. mark all pages of interest idle by setting corresponding bits in the
            /sys/kernel/mm/page_idle/bitmap
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
      
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
      
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      below.
      
      ==== REASONING ====
      
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
      drawbacks:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      
      ==== PATCHSET STRUCTURE ====
      
      The patch set is organized as follows:
      
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      
      ==== SIMILAR WORKS ====
      
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      
      ==== PERFORMANCE EVALUATION ====
      
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
      out:
      
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      
      For tracking idle memory, idlememstat utility was used:
      https://github.com/locker/idlememstat
      
      testcase            base            patched        patched-active
      
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      
      time idlememstat:
      
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      
      ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
      #! /usr/bin/python
      #
      
      import os
      import stat
      import errno
      import struct
      
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
              try:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
                      break
                  raise
          f.close()
      
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
      
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
      
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
                  break
      
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
      
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
      
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
      
          f_flags.close()
          f_cgroup.close()
          return idlememsz
      
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          set_idle()
      
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
      
          print "Counting idle pages..."
          idlememsz = count_idle()
      
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      
      This patch (of 8):
      
      Add page_cgroup_ino() helper to memcg.
      
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fc04524
  17. 09 9月, 2015 2 次提交