1. 27 7月, 2016 7 次提交
  2. 23 7月, 2016 1 次提交
    • J
      mm: memcontrol: fix cgroup creation failure after many small jobs · 73f576c0
      Johannes Weiner 提交于
      The memory controller has quite a bit of state that usually outlives the
      cgroup and pins its CSS until said state disappears.  At the same time
      it imposes a 16-bit limit on the CSS ID space to economically store IDs
      in the wild.  Consequently, when we use cgroups to contain frequent but
      small and short-lived jobs that leave behind some page cache, we quickly
      run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
      fails with -ENOSPC while there are only a few, or even no user-visible
      cgroups in existence.
      
      Although pinning CSSs past cgroup removal is common, there are only two
      instances that actually need an ID after a cgroup is deleted: cache
      shadow entries and swapout records.
      
      Cache shadow entries reference the ID weakly and can deal with the CSS
      having disappeared when it's looked up later.  They pose no hurdle.
      
      Swap-out records do need to pin the css to hierarchically attribute
      swapins after the cgroup has been deleted; though the only pages that
      remain swapped out after offlining are tmpfs/shmem pages.  And those
      references are under the user's control, so they are manageable.
      
      This patch introduces a private 16-bit memcg ID and switches swap and
      cache shadow entries over to using that.  This ID can then be recycled
      after offlining when the CSS remains pinned only by objects that don't
      specifically need it.
      
      This script demonstrates the problem by faulting one cache page in a new
      cgroup and deleting it again:
      
        set -e
        mkdir -p pages
        for x in `seq 128000`; do
          [ $((x % 1000)) -eq 0 ] && echo $x
          mkdir /cgroup/foo
          echo $$ >/cgroup/foo/cgroup.procs
          echo trex >pages/$x
          echo $$ >/cgroup/cgroup.procs
          rmdir /cgroup/foo
        done
      
      When run on an unpatched kernel, we eventually run out of possible IDs
      even though there are no visible cgroups:
      
        [root@ham ~]# ./cssidstress.sh
        [...]
        65000
        mkdir: cannot create directory '/cgroup/foo': No space left on device
      
      After this patch, the IDs get released upon cgroup destruction and the
      cache and css objects get released once memory reclaim kicks in.
      
      [hannes@cmpxchg.org: init the IDR]
        Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
      Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups")
      Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJohn Garcia <john.garcia@mesosphere.io>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: <stable@vger.kernel.org>	[3.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73f576c0
  3. 25 6月, 2016 2 次提交
    • T
      memcg: css_alloc should return an ERR_PTR value on error · ea3a9645
      Tejun Heo 提交于
      mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
      expected it to return an ERR_PTR value leading to the following NULL
      deref after a css allocation failure.  Fix it by return
      ERR_PTR(-ENOMEM) instead.  I'll also update cgroup core so that it
      can handle NULL returns.
      
        mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        ...
        Call Trace:
          dump_stack+0x68/0xa1
          warn_alloc_failed+0xd6/0x130
          __alloc_pages_nodemask+0x4c6/0xf20
          alloc_pages_current+0x66/0xe0
          alloc_kmem_pages+0x14/0x80
          kmalloc_order_trace+0x2a/0x1a0
          __kmalloc+0x291/0x310
          memcg_update_all_caches+0x6c/0x130
          mem_cgroup_css_alloc+0x590/0x610
          cgroup_apply_control_enable+0x18b/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        ...
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP:  init_and_link_css+0x37/0x220
        PGD 34b1e067 PUD 3a109067 PMD 0
        Oops: 0002 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
        task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
        RIP: 0010:[<ffffffff810f2ca7>]  [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220
        RSP: 0018:ffff8800666d7d90  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
        RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
        R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
        R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
        FS:  00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          cgroup_apply_control_enable+0x1ac/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
        RIP   init_and_link_css+0x37/0x220
         RSP <ffff8800666d7d90>
        CR2: 00000000000000d0
        ---[ end trace a2d8836ae1e852d1 ]---
      
      Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea3a9645
    • T
      memcg: mem_cgroup_migrate() may be called with irq disabled · d93c4130
      Tejun Heo 提交于
      mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
      with irq disabled from migrate_page_copy().  This ends up enabling irq
      while holding a irq context lock triggering the following lockdep
      warning.  Fix it by using irq_save/restore instead.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.7.0-rc1+ #52 Tainted: G        W
        ---------------------------------
        inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8
        {IN-SOFTIRQ-W} state was registered at:
           __lock_acquire+0x5b6/0x1930
           lock_acquire+0xee/0x270
           _raw_spin_lock_irqsave+0x66/0xb0
           aio_complete+0x98/0x328
           dio_complete+0xe4/0x1e0
           blk_update_request+0xd4/0x450
           scsi_end_request+0x48/0x1c8
           scsi_io_completion+0x272/0x698
           blk_done_softirq+0xca/0xe8
           __do_softirq+0xc8/0x518
           irq_exit+0xee/0x110
           do_IRQ+0x6a/0x88
           io_int_handler+0x11a/0x25c
           __mutex_unlock_slowpath+0x144/0x1d8
           __mutex_unlock_slowpath+0x140/0x1d8
           kernfs_iop_permission+0x64/0x80
           __inode_permission+0x9e/0xf0
           link_path_walk+0x6e/0x510
           path_lookupat+0xc4/0x1a8
           filename_lookup+0x9c/0x160
           user_path_at_empty+0x5c/0x70
           SyS_readlinkat+0x68/0x140
           system_call+0xd6/0x270
        irq event stamp: 971410
        hardirqs last  enabled at (971409):  migrate_page_move_mapping+0x3ea/0x588
        hardirqs last disabled at (971410):  _raw_spin_lock_irqsave+0x3c/0xb0
        softirqs last  enabled at (970526):  __do_softirq+0x460/0x518
        softirqs last disabled at (970519):  irq_exit+0xee/0x110
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
      	 CPU0
      	 ----
          lock(&(&ctx->completion_lock)->rlock);
          <Interrupt>
            lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
        3 locks held by kcompactd0/151:
         #0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at:  aio_migratepage+0x42/0x1e8
         #1:  (&ctx->ring_lock){+.+.+.}, at:  aio_migratepage+0x5a/0x1e8
         #2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at:  aio_migratepage+0x156/0x1e8
      
        stack backtrace:
        CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G        W       4.7.0-rc1+ #52
        Call Trace:
          show_trace+0xea/0xf0
          show_stack+0x72/0xf0
          dump_stack+0x9a/0xd8
          print_usage_bug.part.27+0x2d4/0x2e8
          mark_lock+0x17e/0x758
          mark_held_locks+0xa2/0xd0
          trace_hardirqs_on_caller+0x140/0x1c0
          mem_cgroup_migrate+0x266/0x370
          aio_migratepage+0x16a/0x1e8
          move_to_new_page+0xb0/0x260
          migrate_pages+0x8f4/0x9f0
          compact_zone+0x4dc/0xdc8
          kcompactd_do_work+0x1aa/0x358
          kcompactd+0xba/0x2c8
          kthread+0x10a/0x110
          kernel_thread_starter+0x6/0xc
          kernel_thread_starter+0x0/0xc
        INFO: lockdep is turned off.
      
      Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
      Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
      Fixes: 74485cf2 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d93c4130
  4. 10 6月, 2016 1 次提交
  5. 04 6月, 2016 1 次提交
    • T
      memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem() · 3a06bb78
      Tejun Heo 提交于
      memcg_offline_kmem() may be called from memcg_free_kmem() after a css
      init failure.  memcg_free_kmem() is a ->css_free callback which is
      called without cgroup_mutex and memcg_offline_kmem() ends up using
      css_for_each_descendant_pre() without any locking.  Fix it by adding rcu
      read locking around it.
      
          mkdir: cannot create directory `65530': No space left on device
          ===============================
          [ INFO: suspicious RCU usage. ]
          4.6.0-work+ #321 Not tainted
          -------------------------------
          kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
           [  527.243970] other info that might help us debug this:
           [  527.244715]
          rcu_scheduler_active = 1, debug_locks = 0
          2 locks held by kworker/0:5/1664:
           #0:  ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
           #1:  ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
           [  527.248098] stack backtrace:
          CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
          Workqueue: cgroup_destroy css_free_work_fn
          Call Trace:
            dump_stack+0x68/0xa1
            lockdep_rcu_suspicious+0xd7/0x110
            css_next_descendant_pre+0x7d/0xb0
            memcg_offline_kmem.part.44+0x4a/0xc0
            mem_cgroup_css_free+0x1ec/0x200
            css_free_work_fn+0x49/0x5e0
            process_one_work+0x1c5/0x4a0
            worker_thread+0x49/0x490
            kthread+0xea/0x100
            ret_from_fork+0x1f/0x40
      
      Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a06bb78
  6. 28 5月, 2016 2 次提交
  7. 27 5月, 2016 1 次提交
  8. 24 5月, 2016 1 次提交
  9. 21 5月, 2016 1 次提交
  10. 20 5月, 2016 5 次提交
    • M
      oom, oom_reaper: try to reap tasks which skip regular OOM killer path · 3ef22dff
      Michal Hocko 提交于
      If either the current task is already killed or PF_EXITING or a selected
      task is PF_EXITING then the oom killer is suppressed and so is the oom
      reaper.  This patch adds try_oom_reaper which checks the given task and
      queues it for the oom reaper if that is safe to be done meaning that the
      task doesn't share the mm with an alive process.
      
      This might help to release the memory pressure while the task tries to
      exit.
      
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ef22dff
    • H
      mm: update_lru_size do the __mod_zone_page_state · 9d5e6a9f
      Hugh Dickins 提交于
      Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
      reclaim was removed) that lru_size can be updated by -nr_taken once per
      call to isolate_lru_pages(), instead of page by page.
      
      Update it inside isolate_lru_pages(), or at its two callsites? I chose
      to update it at the callsites, rearranging and grouping the updates by
      nr_taken and nr_scanned together in both.
      
      With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
      __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
      some more calls in a future commit.  Make the code a little smaller and
      simpler by incorporating stat update in lru_size update.
      
      The exception was move_active_pages_to_lru(), which aggregated the
      pgmoved stat update separately from the individual lru_size updates; but
      I still think this a simplification worth making.
      
      However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
      better use the name update_lru_size, calls mem_cgroup_update_lru_size
      when CONFIG_MEMCG.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d5e6a9f
    • H
      mm: update_lru_size warn and reset bad lru_size · ca707239
      Hugh Dickins 提交于
      Though debug kernels have a VM_BUG_ON to help protect from misaccounting
      lru_size, non-debug kernels are liable to wrap it around: and then the
      vast unsigned long size draws page reclaim into a loop of repeatedly
      doing nothing on an empty list, without even a cond_resched().
      
      That soft lockup looks confusingly like an over-busy reclaim scenario,
      with lots of contention on the lru_lock in shrink_inactive_list(): yet
      has a totally different origin.
      
      Help differentiate with a custom warning in
      mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the
      size to avoid the lockup.  But the particular bug which suggested this
      change was mine alone, and since fixed.
      
      Make it a WARN_ONCE: the first occurrence is the most informative, a
      flurry may follow, yet even when rate-limited little more is learnt.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca707239
    • M
      mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment · fda3d69b
      Michal Hocko 提交于
      > The comment seems to have not much to do with the code?
      
      I guess the comment tries to say that the code path is triggered when we
      charge the page which happens _before_ it is added to the LRU list and
      so last_scanned_node might contain the stale data.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fda3d69b
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  11. 26 4月, 2016 1 次提交
    • T
      memcg: relocate charge moving from ->attach to ->post_attach · 264a0ae1
      Tejun Heo 提交于
      Hello,
      
      So, this ended up a lot simpler than I originally expected.  I tested
      it lightly and it seems to work fine.  Petr, can you please test these
      two patches w/o the lru drain drop patch and see whether the problem
      is gone?
      
      Thanks.
      ------ 8< ------
      If charge moving is used, memcg performs relabeling of the affected
      pages from its ->attach callback which is called under both
      cgroup_threadgroup_rwsem and thus can't create new kthreads.  This is
      fragile as various operations may depend on workqueues making forward
      progress which relies on the ability to create new kthreads.
      
      There's no reason to perform charge moving from ->attach which is deep
      in the task migration path.  Move it to ->post_attach which is called
      after the actual migration is finished and cgroup_threadgroup_rwsem is
      dropped.
      
      * move_charge_struct->mm is added and ->can_attach is now responsible
        for pinning and recording the target mm.  mem_cgroup_clear_mc() is
        updated accordingly.  This also simplifies mem_cgroup_move_task().
      
      * mem_cgroup_move_task() is now called from ->post_attach instead of
        ->attach.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Debugged-and-tested-by: NPetr Mladek <pmladek@suse.com>
      Reported-by: NCyril Hrubis <chrubis@suse.cz>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Fixes: 1ed13287 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
      Cc: <stable@vger.kernel.org> # 4.4+
      264a0ae1
  12. 18 3月, 2016 12 次提交
  13. 16 3月, 2016 5 次提交