1. 23 2月, 2017 2 次提交
    • T
      slab: use memcg_kmem_cache_wq for slab destruction operations · 17cc4dfe
      Tejun Heo 提交于
      If there's contention on slab_mutex, queueing the per-cache destruction
      work item on the system_wq can unnecessarily create and tie up a lot of
      kworkers.
      
      Rename memcg_kmem_cache_create_wq to memcg_kmem_cache_wq and make it
      global and use that workqueue for the destruction work items too.  While
      at it, convert the workqueue from an unbound workqueue to a per-cpu one
      with concurrency limited to 1.  It's generally preferable to use per-cpu
      workqueues and concurrency limit of 1 is safe enough.
      
      This is suggested by Joonsoo Kim.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-11-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov@tarantool.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17cc4dfe
    • T
      slab: link memcg kmem_caches on their associated memory cgroup · bc2791f8
      Tejun Heo 提交于
      With kmem cgroup support enabled, kmem_caches can be created and
      destroyed frequently and a great number of near empty kmem_caches can
      accumulate if there are a lot of transient cgroups and the system is not
      under memory pressure.  When memory reclaim starts under such
      conditions, it can lead to consecutive deactivation and destruction of
      many kmem_caches, easily hundreds of thousands on moderately large
      systems, exposing scalability issues in the current slab management
      code.  This is one of the patches to address the issue.
      
      While a memcg kmem_cache is listed on its root cache's ->children list,
      there is no direct way to iterate all kmem_caches which are assocaited
      with a memory cgroup.  The only way to iterate them is walking all
      caches while filtering out caches which don't match, which would be most
      of them.
      
      This makes memcg destruction operations O(N^2) where N is the total
      number of slab caches which can be huge.  This combined with the
      synchronous RCU operations can tie up a CPU and affect the whole machine
      for many hours when memory reclaim triggers offlining and destruction of
      the stale memcgs.
      
      This patch adds mem_cgroup->kmem_caches list which goes through
      memcg_cache_params->kmem_caches_node of all kmem_caches which are
      associated with the memcg.  All memcg specific iterations, including
      stat file access, are updated to use the new list instead.
      
      Link: http://lkml.kernel.org/r/20170117235411.9408-6-tj@kernel.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJay Vana <jsvana@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc2791f8
  2. 25 1月, 2017 1 次提交
  3. 11 1月, 2017 1 次提交
    • M
      mm, memcg: fix the active list aging for lowmem requests when memcg is enabled · b4536f0c
      Michal Hocko 提交于
      Nils Holland and Klaus Ethgen have reported unexpected OOM killer
      invocations with 32b kernel starting with 4.8 kernels
      
      	kworker/u4:5 invoked oom-killer: gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
      	kworker/u4:5 cpuset=/ mems_allowed=0
      	CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
      	[...]
      	Mem-Info:
      	active_anon:58685 inactive_anon:90 isolated_anon:0
      	 active_file:274324 inactive_file:281962 isolated_file:0
      	 unevictable:0 dirty:649 writeback:0 unstable:0
      	 slab_reclaimable:40662 slab_unreclaimable:17754
      	 mapped:7382 shmem:202 pagetables:351 bounce:0
      	 free:206736 free_pcp:332 free_cma:0
      	Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? no
      	DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB writepending:96kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
      	lowmem_reserve[]: 0 813 3474 3474
      	Normal free:41332kB min:41368kB low:51708kB high:62048kB active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
      	lowmem_reserve[]: 0 0 21292 21292
      	HighMem free:781660kB min:512kB low:34356kB high:68200kB active_anon:234740kB inactive_anon:360kB active_file:557232kB inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB free_cma:0kB
      
      the oom killer is clearly pre-mature because there there is still a lot
      of page cache in the zone Normal which should satisfy this lowmem
      request.  Further debugging has shown that the reclaim cannot make any
      forward progress because the page cache is hidden in the active list
      which doesn't get rotated because inactive_list_is_low is not memcg
      aware.
      
      The code simply subtracts per-zone highmem counters from the respective
      memcg's lru sizes which doesn't make any sense.  We can simply end up
      always seeing the resulting active and inactive counts 0 and return
      false.  This issue is not limited to 32b kernels but in practice the
      effect on systems without CONFIG_HIGHMEM would be much harder to notice
      because we do not invoke the OOM killer for allocations requests
      targeting < ZONE_NORMAL.
      
      Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
      and subtract per-memcg highmem counts when memcg is enabled.  Introduce
      helper lruvec_zone_lru_size which redirects to either zone counters or
      mem_cgroup_get_zone_lru_size when appropriate.
      
      We are losing empty LRU but non-zero lru size detection introduced by
      ca707239 ("mm: update_lru_size warn and reset bad lru_size") because
      of the inherent zone vs. node discrepancy.
      
      Fixes: f8d1a311 ("mm: consider whether to decivate based on eligible zones inactive ratio")
      Link: http://lkml.kernel.org/r/20170104100825.3729-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Tested-by: NNils Holland <nholland@tisys.org>
      Reported-by: NKlaus Ethgen <Klaus@Ethgen.de>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4536f0c
  4. 25 12月, 2016 1 次提交
  5. 13 12月, 2016 1 次提交
  6. 10 11月, 2016 1 次提交
  7. 28 10月, 2016 1 次提交
    • J
      mm: memcontrol: do not recurse in direct reclaim · 89a28483
      Johannes Weiner 提交于
      On 4.0, we saw a stack corruption from a page fault entering direct
      memory cgroup reclaim, calling into btrfs_releasepage(), which then
      tried to allocate an extent and recursed back into a kmem charge ad
      nauseam:
      
        [...]
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        memcg_charge_kmem+0x40/0x80
        new_slab+0x2d9/0x5a0
        __slab_alloc+0x2fd/0x44f
        kmem_cache_alloc+0x193/0x1e0
        alloc_extent_state+0x21/0xc0
        __clear_extent_bit+0x2b5/0x400
        try_release_extent_mapping+0x1a3/0x220
        __btrfs_releasepage+0x31/0x70
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        mem_cgroup_try_charge+0x65/0x1c0
        handle_mm_fault+0x117f/0x1510
        __do_page_fault+0x177/0x420
        do_page_fault+0xc/0x10
        page_fault+0x22/0x30
      
      On later kernels, kmem charging is opt-in rather than opt-out, and that
      particular kmem allocation in btrfs_releasepage() is no longer being
      charged and won't recurse and overrun the stack anymore.
      
      But it's not impossible for an accounted allocation to happen from the
      memcg direct reclaim context, and we needed to reproduce this crash many
      times before we even got a useful stack trace out of it.
      
      Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
      avoid recursing into any other form of direct reclaim.  Then let
      recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
      
      Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89a28483
  8. 08 10月, 2016 5 次提交
  9. 20 9月, 2016 1 次提交
    • J
      mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting · db2ba40c
      Johannes Weiner 提交于
      During cgroup2 rollout into production, we started encountering css
      refcount underflows and css access crashes in the memory controller.
      Splitting the heavily shared css reference counter into logical users
      narrowed the imbalance down to the cgroup2 socket memory accounting.
      
      The problem turns out to be the per-cpu charge cache.  Cgroup1 had a
      separate socket counter, but the new cgroup2 socket accounting goes
      through the common charge path that uses a shared per-cpu cache for all
      memory that is being tracked.  Those caches are safe against scheduling
      preemption, but not against interrupts - such as the newly added packet
      receive path.  When cache draining is interrupted by network RX taking
      pages out of the cache, the resuming drain operation will put references
      of in-use pages, thus causing the imbalance.
      
      Disable IRQs during all per-cpu charge cache operations.
      
      Fixes: f7e1cb6e ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
      Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db2ba40c
  10. 27 8月, 2016 1 次提交
  11. 12 8月, 2016 2 次提交
  12. 10 8月, 2016 1 次提交
    • V
      mm: memcontrol: only mark charged pages with PageKmemcg · c4159a75
      Vladimir Davydov 提交于
      To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
      which sets page->_mapcount to -512.  Currently, we set/clear PageKmemcg
      in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
      with __GFP_ACCOUNT, including those that aren't actually charged to any
      cgroup, i.e. allocated from the root cgroup context.  To avoid overhead
      in case cgroups are not used, we only do that if memcg_kmem_enabled() is
      true.  The latter is set iff there are kmem-enabled memory cgroups
      (online or offline).  The root cgroup is not considered kmem-enabled.
      
      As a result, if a page is allocated with __GFP_ACCOUNT for the root
      cgroup when there are kmem-enabled memory cgroups and is freed after all
      kmem-enabled memory cgroups were removed, e.g.
      
        # no memory cgroups has been created yet, create one
        mkdir /sys/fs/cgroup/memory/test
        # run something allocating pages with __GFP_ACCOUNT, e.g.
        # a program using pipe
        dmesg | tail
        # remove the memory cgroup
        rmdir /sys/fs/cgroup/memory/test
      
      we'll get bad page state bug complaining about page->_mapcount != -1:
      
        BUG: Bad page state in process swapper/0  pfn:1fd945c
        page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
        flags: 0x1000000000000000()
      
      To avoid that, let's mark with PageKmemcg only those pages that are
      actually charged to and hence pin a non-root memory cgroup.
      
      Fixes: 4949148a ("mm: charge/uncharge kmemcg from generic page allocator paths")
      Reported-and-tested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4159a75
  13. 03 8月, 2016 1 次提交
    • M
      memcg: put soft limit reclaim out of way if the excess tree is empty · d6507ff5
      Michal Hocko 提交于
      We've had a report about soft lockups caused by lock bouncing in the
      soft reclaim path:
      
        BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
        RIP: 0010:[<ffffffff81469798>]  [<ffffffff81469798>] _raw_spin_lock+0x18/0x20
        Call Trace:
          mem_cgroup_soft_limit_reclaim+0x25a/0x280
          shrink_zones+0xed/0x200
          do_try_to_free_pages+0x74/0x320
          try_to_free_pages+0x112/0x180
          __alloc_pages_slowpath+0x3ff/0x820
          __alloc_pages_nodemask+0x1e9/0x200
          alloc_pages_vma+0xe1/0x290
          do_wp_page+0x19f/0x840
          handle_pte_fault+0x1cd/0x230
          do_page_fault+0x1fd/0x4c0
          page_fault+0x25/0x30
      
      There are no memcgs created so there cannot be any in the soft limit
      excess obviously:
      
        [...]
        memory  0       1       1
      
      so all this just seems to be mem_cgroup_largest_soft_limit_node trying
      to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
      excess tree is empty.  This is just pointless wasting of cycles and
      cache line bouncing during heavy parallel reclaim on large machines.
      The particular machine wasn't very healthy and most probably suffering
      from a memory leak which just caused the memory reclaim to trash
      heavily.  But bouncing on the lock certainly didn't help...
      
      Fix this by optimistic lockless check and bail out early if the tree is
      empty.  This is theoretically racy but that shouldn't matter all that
      much.  First of all soft limit is a best effort feature and it is slowly
      getting deprecated and its usage should be really scarce.  Bouncing on a
      lock without a good reason is surely much bigger problem, especially on
      large CPU machines.
      
      Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6507ff5
  14. 29 7月, 2016 8 次提交
  15. 27 7月, 2016 7 次提交
  16. 23 7月, 2016 1 次提交
    • J
      mm: memcontrol: fix cgroup creation failure after many small jobs · 73f576c0
      Johannes Weiner 提交于
      The memory controller has quite a bit of state that usually outlives the
      cgroup and pins its CSS until said state disappears.  At the same time
      it imposes a 16-bit limit on the CSS ID space to economically store IDs
      in the wild.  Consequently, when we use cgroups to contain frequent but
      small and short-lived jobs that leave behind some page cache, we quickly
      run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
      fails with -ENOSPC while there are only a few, or even no user-visible
      cgroups in existence.
      
      Although pinning CSSs past cgroup removal is common, there are only two
      instances that actually need an ID after a cgroup is deleted: cache
      shadow entries and swapout records.
      
      Cache shadow entries reference the ID weakly and can deal with the CSS
      having disappeared when it's looked up later.  They pose no hurdle.
      
      Swap-out records do need to pin the css to hierarchically attribute
      swapins after the cgroup has been deleted; though the only pages that
      remain swapped out after offlining are tmpfs/shmem pages.  And those
      references are under the user's control, so they are manageable.
      
      This patch introduces a private 16-bit memcg ID and switches swap and
      cache shadow entries over to using that.  This ID can then be recycled
      after offlining when the CSS remains pinned only by objects that don't
      specifically need it.
      
      This script demonstrates the problem by faulting one cache page in a new
      cgroup and deleting it again:
      
        set -e
        mkdir -p pages
        for x in `seq 128000`; do
          [ $((x % 1000)) -eq 0 ] && echo $x
          mkdir /cgroup/foo
          echo $$ >/cgroup/foo/cgroup.procs
          echo trex >pages/$x
          echo $$ >/cgroup/cgroup.procs
          rmdir /cgroup/foo
        done
      
      When run on an unpatched kernel, we eventually run out of possible IDs
      even though there are no visible cgroups:
      
        [root@ham ~]# ./cssidstress.sh
        [...]
        65000
        mkdir: cannot create directory '/cgroup/foo': No space left on device
      
      After this patch, the IDs get released upon cgroup destruction and the
      cache and css objects get released once memory reclaim kicks in.
      
      [hannes@cmpxchg.org: init the IDR]
        Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
      Fixes: b2052564 ("mm: memcontrol: continue cache reclaim from offlined groups")
      Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NJohn Garcia <john.garcia@mesosphere.io>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Nikolay Borisov <kernel@kyup.com>
      Cc: <stable@vger.kernel.org>	[3.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73f576c0
  17. 25 6月, 2016 2 次提交
    • T
      memcg: css_alloc should return an ERR_PTR value on error · ea3a9645
      Tejun Heo 提交于
      mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
      expected it to return an ERR_PTR value leading to the following NULL
      deref after a css allocation failure.  Fix it by return
      ERR_PTR(-ENOMEM) instead.  I'll also update cgroup core so that it
      can handle NULL returns.
      
        mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        ...
        Call Trace:
          dump_stack+0x68/0xa1
          warn_alloc_failed+0xd6/0x130
          __alloc_pages_nodemask+0x4c6/0xf20
          alloc_pages_current+0x66/0xe0
          alloc_kmem_pages+0x14/0x80
          kmalloc_order_trace+0x2a/0x1a0
          __kmalloc+0x291/0x310
          memcg_update_all_caches+0x6c/0x130
          mem_cgroup_css_alloc+0x590/0x610
          cgroup_apply_control_enable+0x18b/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        ...
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
        IP:  init_and_link_css+0x37/0x220
        PGD 34b1e067 PUD 3a109067 PMD 0
        Oops: 0002 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
        task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
        RIP: 0010:[<ffffffff810f2ca7>]  [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220
        RSP: 0018:ffff8800666d7d90  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
        RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
        RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
        R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
        R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
        FS:  00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
          cgroup_apply_control_enable+0x1ac/0x370
          cgroup_mkdir+0x1de/0x2e0
          kernfs_iop_mkdir+0x55/0x80
          vfs_mkdir+0xb9/0x150
          SyS_mkdir+0x66/0xd0
          do_syscall_64+0x53/0x120
          entry_SYSCALL64_slow_path+0x25/0x25
        Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
        RIP   init_and_link_css+0x37/0x220
         RSP <ffff8800666d7d90>
        CR2: 00000000000000d0
        ---[ end trace a2d8836ae1e852d1 ]---
      
      Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea3a9645
    • T
      memcg: mem_cgroup_migrate() may be called with irq disabled · d93c4130
      Tejun Heo 提交于
      mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
      with irq disabled from migrate_page_copy().  This ends up enabling irq
      while holding a irq context lock triggering the following lockdep
      warning.  Fix it by using irq_save/restore instead.
      
        =================================
        [ INFO: inconsistent lock state ]
        4.7.0-rc1+ #52 Tainted: G        W
        ---------------------------------
        inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
        kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8
        {IN-SOFTIRQ-W} state was registered at:
           __lock_acquire+0x5b6/0x1930
           lock_acquire+0xee/0x270
           _raw_spin_lock_irqsave+0x66/0xb0
           aio_complete+0x98/0x328
           dio_complete+0xe4/0x1e0
           blk_update_request+0xd4/0x450
           scsi_end_request+0x48/0x1c8
           scsi_io_completion+0x272/0x698
           blk_done_softirq+0xca/0xe8
           __do_softirq+0xc8/0x518
           irq_exit+0xee/0x110
           do_IRQ+0x6a/0x88
           io_int_handler+0x11a/0x25c
           __mutex_unlock_slowpath+0x144/0x1d8
           __mutex_unlock_slowpath+0x140/0x1d8
           kernfs_iop_permission+0x64/0x80
           __inode_permission+0x9e/0xf0
           link_path_walk+0x6e/0x510
           path_lookupat+0xc4/0x1a8
           filename_lookup+0x9c/0x160
           user_path_at_empty+0x5c/0x70
           SyS_readlinkat+0x68/0x140
           system_call+0xd6/0x270
        irq event stamp: 971410
        hardirqs last  enabled at (971409):  migrate_page_move_mapping+0x3ea/0x588
        hardirqs last disabled at (971410):  _raw_spin_lock_irqsave+0x3c/0xb0
        softirqs last  enabled at (970526):  __do_softirq+0x460/0x518
        softirqs last disabled at (970519):  irq_exit+0xee/0x110
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
      	 CPU0
      	 ----
          lock(&(&ctx->completion_lock)->rlock);
          <Interrupt>
            lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
        3 locks held by kcompactd0/151:
         #0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at:  aio_migratepage+0x42/0x1e8
         #1:  (&ctx->ring_lock){+.+.+.}, at:  aio_migratepage+0x5a/0x1e8
         #2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at:  aio_migratepage+0x156/0x1e8
      
        stack backtrace:
        CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G        W       4.7.0-rc1+ #52
        Call Trace:
          show_trace+0xea/0xf0
          show_stack+0x72/0xf0
          dump_stack+0x9a/0xd8
          print_usage_bug.part.27+0x2d4/0x2e8
          mark_lock+0x17e/0x758
          mark_held_locks+0xa2/0xd0
          trace_hardirqs_on_caller+0x140/0x1c0
          mem_cgroup_migrate+0x266/0x370
          aio_migratepage+0x16a/0x1e8
          move_to_new_page+0xb0/0x260
          migrate_pages+0x8f4/0x9f0
          compact_zone+0x4dc/0xdc8
          kcompactd_do_work+0x1aa/0x358
          kcompactd+0xba/0x2c8
          kthread+0x10a/0x110
          kernel_thread_starter+0x6/0xc
          kernel_thread_starter+0x0/0xc
        INFO: lockdep is turned off.
      
      Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
      Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
      Fixes: 74485cf2 ("mm: migrate: consolidate mem_cgroup_migrate() calls")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d93c4130
  18. 10 6月, 2016 1 次提交
  19. 04 6月, 2016 1 次提交
    • T
      memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem() · 3a06bb78
      Tejun Heo 提交于
      memcg_offline_kmem() may be called from memcg_free_kmem() after a css
      init failure.  memcg_free_kmem() is a ->css_free callback which is
      called without cgroup_mutex and memcg_offline_kmem() ends up using
      css_for_each_descendant_pre() without any locking.  Fix it by adding rcu
      read locking around it.
      
          mkdir: cannot create directory `65530': No space left on device
          ===============================
          [ INFO: suspicious RCU usage. ]
          4.6.0-work+ #321 Not tainted
          -------------------------------
          kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
           [  527.243970] other info that might help us debug this:
           [  527.244715]
          rcu_scheduler_active = 1, debug_locks = 0
          2 locks held by kworker/0:5/1664:
           #0:  ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
           #1:  ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
           [  527.248098] stack backtrace:
          CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
          Workqueue: cgroup_destroy css_free_work_fn
          Call Trace:
            dump_stack+0x68/0xa1
            lockdep_rcu_suspicious+0xd7/0x110
            css_next_descendant_pre+0x7d/0xb0
            memcg_offline_kmem.part.44+0x4a/0xc0
            mem_cgroup_css_free+0x1ec/0x200
            css_free_work_fn+0x49/0x5e0
            process_one_work+0x1c5/0x4a0
            worker_thread+0x49/0x490
            kthread+0xea/0x100
            ret_from_fork+0x1f/0x40
      
      Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.orgSigned-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a06bb78
  20. 28 5月, 2016 1 次提交