1. 10 6月, 2016 1 次提交
  2. 04 6月, 2016 1 次提交
    • T
      memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem() · 3a06bb78
      Tejun Heo 提交于
      memcg_offline_kmem() may be called from memcg_free_kmem() after a css
      init failure.  memcg_free_kmem() is a ->css_free callback which is
      called without cgroup_mutex and memcg_offline_kmem() ends up using
      css_for_each_descendant_pre() without any locking.  Fix it by adding rcu
      read locking around it.
      
          mkdir: cannot create directory `65530': No space left on device
          ===============================
          [ INFO: suspicious RCU usage. ]
          4.6.0-work+ #321 Not tainted
          -------------------------------
          kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
           [  527.243970] other info that might help us debug this:
           [  527.244715]
          rcu_scheduler_active = 1, debug_locks = 0
          2 locks held by kworker/0:5/1664:
           #0:  ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
           #1:  ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
           [  527.248098] stack backtrace:
          CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
          Workqueue: cgroup_destroy css_free_work_fn
          Call Trace:
            dump_stack+0x68/0xa1
            lockdep_rcu_suspicious+0xd7/0x110
            css_next_descendant_pre+0x7d/0xb0
            memcg_offline_kmem.part.44+0x4a/0xc0
            mem_cgroup_css_free+0x1ec/0x200
            css_free_work_fn+0x49/0x5e0
            process_one_work+0x1c5/0x4a0
            worker_thread+0x49/0x490
            kthread+0xea/0x100
            ret_from_fork+0x1f/0x40
      
      Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a06bb78
  3. 28 5月, 2016 2 次提交
  4. 27 5月, 2016 1 次提交
  5. 24 5月, 2016 1 次提交
  6. 21 5月, 2016 1 次提交
  7. 20 5月, 2016 5 次提交
    • M
      oom, oom_reaper: try to reap tasks which skip regular OOM killer path · 3ef22dff
      Michal Hocko 提交于
      If either the current task is already killed or PF_EXITING or a selected
      task is PF_EXITING then the oom killer is suppressed and so is the oom
      reaper.  This patch adds try_oom_reaper which checks the given task and
      queues it for the oom reaper if that is safe to be done meaning that the
      task doesn't share the mm with an alive process.
      
      This might help to release the memory pressure while the task tries to
      exit.
      
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ef22dff
    • H
      mm: update_lru_size do the __mod_zone_page_state · 9d5e6a9f
      Hugh Dickins 提交于
      Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
      reclaim was removed) that lru_size can be updated by -nr_taken once per
      call to isolate_lru_pages(), instead of page by page.
      
      Update it inside isolate_lru_pages(), or at its two callsites? I chose
      to update it at the callsites, rearranging and grouping the updates by
      nr_taken and nr_scanned together in both.
      
      With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
      __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
      some more calls in a future commit.  Make the code a little smaller and
      simpler by incorporating stat update in lru_size update.
      
      The exception was move_active_pages_to_lru(), which aggregated the
      pgmoved stat update separately from the individual lru_size updates; but
      I still think this a simplification worth making.
      
      However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
      better use the name update_lru_size, calls mem_cgroup_update_lru_size
      when CONFIG_MEMCG.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d5e6a9f
    • H
      mm: update_lru_size warn and reset bad lru_size · ca707239
      Hugh Dickins 提交于
      Though debug kernels have a VM_BUG_ON to help protect from misaccounting
      lru_size, non-debug kernels are liable to wrap it around: and then the
      vast unsigned long size draws page reclaim into a loop of repeatedly
      doing nothing on an empty list, without even a cond_resched().
      
      That soft lockup looks confusingly like an over-busy reclaim scenario,
      with lots of contention on the lru_lock in shrink_inactive_list(): yet
      has a totally different origin.
      
      Help differentiate with a custom warning in
      mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the
      size to avoid the lockup.  But the particular bug which suggested this
      change was mine alone, and since fixed.
      
      Make it a WARN_ONCE: the first occurrence is the most informative, a
      flurry may follow, yet even when rate-limited little more is learnt.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca707239
    • M
      mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment · fda3d69b
      Michal Hocko 提交于
      > The comment seems to have not much to do with the code?
      
      I guess the comment tries to say that the code path is triggered when we
      charge the page which happens _before_ it is added to the LRU list and
      so last_scanned_node might contain the stale data.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fda3d69b
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  8. 26 4月, 2016 1 次提交
    • T
      memcg: relocate charge moving from ->attach to ->post_attach · 264a0ae1
      Tejun Heo 提交于
      Hello,
      
      So, this ended up a lot simpler than I originally expected.  I tested
      it lightly and it seems to work fine.  Petr, can you please test these
      two patches w/o the lru drain drop patch and see whether the problem
      is gone?
      
      Thanks.
      ------ 8< ------
      If charge moving is used, memcg performs relabeling of the affected
      pages from its ->attach callback which is called under both
      cgroup_threadgroup_rwsem and thus can't create new kthreads.  This is
      fragile as various operations may depend on workqueues making forward
      progress which relies on the ability to create new kthreads.
      
      There's no reason to perform charge moving from ->attach which is deep
      in the task migration path.  Move it to ->post_attach which is called
      after the actual migration is finished and cgroup_threadgroup_rwsem is
      dropped.
      
      * move_charge_struct->mm is added and ->can_attach is now responsible
        for pinning and recording the target mm.  mem_cgroup_clear_mc() is
        updated accordingly.  This also simplifies mem_cgroup_move_task().
      
      * mem_cgroup_move_task() is now called from ->post_attach instead of
        ->attach.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Debugged-and-tested-by: NPetr Mladek <pmladek@suse.com>
      Reported-by: NCyril Hrubis <chrubis@suse.cz>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Fixes: 1ed13287 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
      Cc: <stable@vger.kernel.org> # 4.4+
      264a0ae1
  9. 18 3月, 2016 12 次提交
  10. 16 3月, 2016 5 次提交
  11. 22 1月, 2016 1 次提交
  12. 21 1月, 2016 9 次提交