1. 07 8月, 2015 1 次提交
  2. 05 8月, 2015 1 次提交
    • M
      mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations · ecf5fc6e
      Michal Hocko 提交于
      Nikolay has reported a hang when a memcg reclaim got stuck with the
      following backtrace:
      
      PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
        #0 __schedule at ffffffff815ab152
        #1 schedule at ffffffff815ab76e
        #2 schedule_timeout at ffffffff815ae5e5
        #3 io_schedule_timeout at ffffffff815aad6a
        #4 bit_wait_io at ffffffff815abfc6
        #5 __wait_on_bit at ffffffff815abda5
        #6 wait_on_page_bit at ffffffff8111fd4f
        #7 shrink_page_list at ffffffff81135445
        #8 shrink_inactive_list at ffffffff81135845
        #9 shrink_lruvec at ffffffff81135ead
       #10 shrink_zone at ffffffff811360c3
       #11 shrink_zones at ffffffff81136eff
       #12 do_try_to_free_pages at ffffffff8113712f
       #13 try_to_free_mem_cgroup_pages at ffffffff811372be
       #14 try_charge at ffffffff81189423
       #15 mem_cgroup_try_charge at ffffffff8118c6f5
       #16 __add_to_page_cache_locked at ffffffff8112137d
       #17 add_to_page_cache_lru at ffffffff81121618
       #18 pagecache_get_page at ffffffff8112170b
       #19 grow_dev_page at ffffffff811c8297
       #20 __getblk_slow at ffffffff811c91d6
       #21 __getblk_gfp at ffffffff811c92c1
       #22 ext4_ext_grow_indepth at ffffffff8124565c
       #23 ext4_ext_create_new_leaf at ffffffff81246ca8
       #24 ext4_ext_insert_extent at ffffffff81246f09
       #25 ext4_ext_map_blocks at ffffffff8124a848
       #26 ext4_map_blocks at ffffffff8121a5b7
       #27 mpage_map_one_extent at ffffffff8121b1fa
       #28 mpage_map_and_submit_extent at ffffffff8121f07b
       #29 ext4_writepages at ffffffff8121f6d5
       #30 do_writepages at ffffffff8112c490
       #31 __filemap_fdatawrite_range at ffffffff81120199
       #32 filemap_flush at ffffffff8112041c
       #33 ext4_alloc_da_blocks at ffffffff81219da1
       #34 ext4_rename at ffffffff81229b91
       #35 ext4_rename2 at ffffffff81229e32
       #36 vfs_rename at ffffffff811a08a5
       #37 SYSC_renameat2 at ffffffff811a3ffc
       #38 sys_renameat2 at ffffffff811a408e
       #39 sys_rename at ffffffff8119e51e
       #40 system_call_fastpath at ffffffff815afa89
      
      Dave Chinner has properly pointed out that this is a deadlock in the
      reclaim code because ext4 doesn't submit pages which are marked by
      PG_writeback right away.
      
      The heuristic was introduced by commit e62e384e ("memcg: prevent OOM
      with too many dirty pages") and it was applied only when may_enter_fs
      was specified.  The code has been changed by c3b94f44 ("memcg:
      further prevent OOM with too many dirty pages") which has removed the
      __GFP_FS restriction with a reasoning that we do not get into the fs
      code.  But this is not sufficient apparently because the fs doesn't
      necessarily submit pages marked PG_writeback for IO right away.
      
      ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
      submit the bio.  Instead it tries to map more pages into the bio and
      mpage_map_one_extent might trigger memcg charge which might end up
      waiting on a page which is marked PG_writeback but hasn't been submitted
      yet so we would end up waiting for something that never finishes.
      
      Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
      before we go to wait on the writeback.  The page fault path, which is
      the only path that triggers memcg oom killer since 3.12, shouldn't
      require GFP_NOFS and so we shouldn't reintroduce the premature OOM
      killer issue which was originally addressed by the heuristic.
      
      As per David Chinner the xfs is doing similar thing since 2.6.15 already
      so ext4 is not the only affected filesystem.  Moreover he notes:
      
      : For example: IO completion might require unwritten extent conversion
      : which executes filesystem transactions and GFP_NOFS allocations. The
      : writeback flag on the pages can not be cleared until unwritten
      : extent conversion completes. Hence memory reclaim cannot wait on
      : page writeback to complete in GFP_NOFS context because it is not
      : safe to do so, memcg reclaim or otherwise.
      
      Cc: stable@vger.kernel.org # 3.9+
      [tytso@mit.edu: corrected the control flow]
      Fixes: c3b94f44 ("memcg: further prevent OOM with too many dirty pages")
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecf5fc6e
  3. 18 7月, 2015 5 次提交
  4. 10 7月, 2015 1 次提交
  5. 02 7月, 2015 3 次提交
    • T
      writeback: don't drain bdi_writeback_congested on bdi destruction · a20135ff
      Tejun Heo 提交于
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      This patch fixes the bug by updating bdi destruction to not wait for
      cong's to drain.  A cong is unlinked from bdi->cgwb_congested_tree on
      bdi destuction regardless of its reference count as the bdi may go
      away any point after destruction.  wb_congested_put() checks whether
      the cong is already unlinked on release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681
      Fixes: 52ebea74 ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks")
      Tested-by: NJon Christopherson <jon@jons.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a20135ff
    • T
      writeback: don't embed root bdi_writeback_congested in bdi_writeback · a13f35e8
      Tejun Heo 提交于
      52ebea74 ("writeback: make backing_dev_info host cgroup-specific
      bdi_writebacks") made bdi (backing_dev_info) host per-cgroup wb's
      (bdi_writeback's).  As the congested state needs to be per-wb and
      referenced from blkcg side and multiple wbs, the patch made all
      non-root cong's (bdi_writeback_congested's) reference counted and
      indexed on bdi.
      
      When a bdi is destroyed, cgwb_bdi_destroy() tries to drain all
      non-root cong's; however, this can hang indefinitely because wb's can
      also be referenced from blkcg_gq's which are destroyed after bdi
      destruction is complete.
      
      To fix the bug, bdi destruction will be updated to not wait for cong's
      to drain, which naturally means that cong's may outlive the associated
      bdi.  This is fine for non-root cong's but is problematic for the root
      cong's which are embedded in their bdi's as they may end up getting
      dereferenced after the containing bdi's are freed.
      
      This patch makes root cong's behave the same as non-root cong's.  They
      are no longer embedded in their bdi's but allocated separately during
      bdi initialization, indexed and reference counted the same way.
      
      * As cong handling is the same for all wb's, wb->congested
        initialization is moved into wb_init().
      
      * When !CONFIG_CGROUP_WRITEBACK, there was no indexing or refcnting.
        bdi->wb_congested is now a pointer pointing to the root cong
        allocated during bdi init and minimal refcnting operations are
        implemented.
      
      * The above makes root wb init paths diverge depending on
        CONFIG_CGROUP_WRITEBACK.  root wb init is moved to cgwb_bdi_init().
      
      This patch in itself shouldn't cause any consequential behavior
      differences but prepares for the actual fix.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJon Christopherson <jon@jons.org>
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=100681Tested-by: NJon Christopherson <jon@jons.org>
      
      Added <linux/slab.h> include to backing-dev.h for kfree() definition.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a13f35e8
    • C
      Add __init attribute to new_kmalloc_cache · ae6f2462
      Christoph Lameter 提交于
      Avoid the warning:
      
        WARNING: mm/built-in.o(.text.unlikely+0xc22): Section mismatch in reference from the function .new_kmalloc_cache() to the variable .init.rodata:kmalloc_info
        The function .new_kmalloc_cache() references
        the variable __initconst kmalloc_info.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae6f2462
  6. 01 7月, 2015 13 次提交
  7. 30 6月, 2015 1 次提交
  8. 26 6月, 2015 6 次提交
  9. 25 6月, 2015 9 次提交
    • L
      mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc() · 8a8c35fa
      Larry Finger 提交于
      Beginning at commit d52d3997 ("ipv6: Create percpu rt6_info"), the
      following INFO splat is logged:
      
        ===============================
        [ INFO: suspicious RCU usage. ]
        4.1.0-rc7-next-20150612 #1 Not tainted
        -------------------------------
        kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section!
        other info that might help us debug this:
        rcu_scheduler_active = 1, debug_locks = 0
         3 locks held by systemd/1:
         #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40
         #1:  (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540
         #2:  (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540
        stack backtrace:
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1
        Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20   04/17/2014
        Call Trace:
          dump_stack+0x4c/0x6e
          lockdep_rcu_suspicious+0xe7/0x120
          ___might_sleep+0x1d5/0x1f0
          __might_sleep+0x4d/0x90
          kmem_cache_alloc+0x47/0x250
          create_object+0x39/0x2e0
          kmemleak_alloc_percpu+0x61/0xe0
          pcpu_alloc+0x370/0x630
      
      Additional backtrace lines are truncated.  In addition, the above splat
      is followed by several "BUG: sleeping function called from invalid
      context at mm/slub.c:1268" outputs.  As suggested by Martin KaFai Lau,
      these are the clue to the fix.  Routine kmemleak_alloc_percpu() always
      uses GFP_KERNEL for its allocations, whereas it should follow the gfp
      from its callers.
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NLarry Finger <Larry.Finger@lwfinger.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a8c35fa
    • V
      mm, thp: respect MPOL_PREFERRED policy with non-local node · 0867a57c
      Vlastimil Babka 提交于
      Since commit 077fcf11 ("mm/thp: allocate transparent hugepages on
      local node"), we handle THP allocations on page fault in a special way -
      for non-interleave memory policies, the allocation is only attempted on
      the node local to the current CPU, if the policy's nodemask allows the
      node.
      
      This is motivated by the assumption that THP benefits cannot offset the
      cost of remote accesses, so it's better to fallback to base pages on the
      local node (which might still be available, while huge pages are not due
      to fragmentation) than to allocate huge pages on a remote node.
      
      The nodemask check prevents us from violating e.g.  MPOL_BIND policies
      where the local node is not among the allowed nodes.  However, the
      current implementation can still give surprising results for the
      MPOL_PREFERRED policy when the preferred node is different than the
      current CPU's local node.
      
      In such case we should honor the preferred node and not use the local
      node, which is what this patch does.  If hugepage allocation on the
      preferred node fails, we fall back to base pages and don't try other
      nodes, with the same motivation as is done for the local node hugepage
      allocations.  The patch also moves the MPOL_INTERLEAVE check around to
      simplify the hugepage specific test.
      
      The difference can be demonstrated using in-tree transhuge-stress test
      on the following 2-node machine where half memory on one node was
      occupied to show the difference.
      
      > numactl --hardware
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
      node 0 size: 7878 MB
      node 0 free: 3623 MB
      node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
      node 1 size: 8045 MB
      node 1 free: 7818 MB
      node distances:
      node   0   1
        0:  10  21
        1:  21  10
      
      Before the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages
      
      Number of successful THP allocations corresponds to free memory on node 0 in
      the first case and node 1 in the second case, i.e. -p parameter is ignored and
      cpu binding "wins".
      
      After the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -p1 -C0 ./transhuge-stress
      transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages
      
      The -p parameter is respected regardless of cpu binding.
      
      > numactl -C0 ./transhuge-stress
      transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -C12 ./transhuge-stress
      transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages
      
      Without -p parameter, hugepage restriction to CPU-local node works as before.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0867a57c
    • J
      tmpfs: truncate prealloc blocks past i_size · afa2db2f
      Josef Bacik 提交于
      One of the rocksdb people noticed that when you do something like this
      
          fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
          pwrite(fd, buf, 5M, 0)
          ftruncate(5M)
      
      on tmpfs, the file would still take up 10M: which led to super fun
      issues because we were getting ENOSPC before we thought we should be
      getting ENOSPC.  This patch fixes the problem, and mirrors what all the
      other fs'es do (and was agreed to be the correct behaviour at LSF).
      
      I tested it locally to make sure it worked properly with the following
      
          xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file
      
      Without the patch we have "Blocks: 20480", with the patch we have the
      correct value of "Blocks: 10240".
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afa2db2f
    • Z
      mm/memory hotplug: print the last vmemmap region at the end of hot add memory · c435a390
      Zhu Guihua 提交于
      When hot add two nodes continuously, we found the vmemmap region info is
      a bit messed.  The last region of node 2 is printed when node 3 hot
      added, like the following:
      
        Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 2 totalpages: 0
         Built 2 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
          [mem 0x40000000000-0x407ffffffff] page 1G
          [ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
          [ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
        ...
          [ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
          [ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
        Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
         On node 3 totalpages: 0
         Built 3 zonelists in Node order, mobility grouping on.  Total pages: 16090539
         Policy zone: Normal
         init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
          [mem 0x60000000000-0x607ffffffff] page 1G
          [ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 <=== node 2 ???
          [ffffea1800000000-ffffea18001fffff] PMD -> [ffff8a074a600000-ffff8a074a7fffff] on node 3
          [ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
          [ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
        ...
      
      The cause is the last region was missed at the and of hot add memory,
      and p_start, p_end, node_start were not reset, so when hot add memory to
      a new node, it will consider they are not contiguous blocks and print
      the previous one.  So we print the last vmemmap region at the end of hot
      add memory to avoid the confusion.
      Signed-off-by: NZhu Guihua <zhugh.fnst@cn.fujitsu.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c435a390
    • P
      mm/mmap.c: optimization of do_mmap_pgoff function · e37609bb
      Piotr Kwapulinski 提交于
      The simple check for zero length memory mapping may be performed
      earlier.  So that in case of zero length memory mapping some unnecessary
      code is not executed at all.  It does not make the code less readable
      and saves some CPU cycles.
      Signed-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e37609bb
    • C
      mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan · 93ada579
      Catalin Marinas 提交于
      The kmemleak memory scanning uses finer grained object->lock spinlocks
      primarily to avoid races with the memory block freeing.  However, the
      pointer lookup in the rb tree requires the kmemleak_lock to be held.
      This is currently done in the find_and_get_object() function for each
      pointer-like location read during scanning.  While this allows a low
      latency on kmemleak_*() callbacks on other CPUs, the memory scanning is
      slower.
      
      This patch moves the kmemleak_lock outside the scan_block() loop,
      acquiring/releasing it only once per scanned memory block.  The
      allow_resched logic is moved outside scan_block() and a new
      scan_large_block() function is implemented which splits large blocks in
      MAX_SCAN_SIZE chunks with cond_resched() calls in-between.  A redundant
      (object->flags & OBJECT_NO_SCAN) check is also removed from
      scan_object().
      
      With this patch, the kmemleak scanning performance is significantly
      improved: at least 50% with lock debugging disabled and over an order of
      magnitude with lock proving enabled (on an arm64 system).
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93ada579
    • C
      mm: kmemleak: avoid deadlock on the kmemleak object insertion error path · 9d5a4c73
      Catalin Marinas 提交于
      While very unlikely (usually kmemleak or sl*b bug), the create_object()
      function in mm/kmemleak.c may fail to insert a newly allocated object into
      the rb tree.  When this happens, kmemleak disables itself and prints
      additional information about the object already found in the rb tree.
      Such printing is done with the parent->lock acquired, however the
      kmemleak_lock is already held.  This is a potential race with the scanning
      thread which acquires object->lock and kmemleak_lock in a
      
      This patch removes the locking around the 'parent' object information
      printing.  Such object cannot be freed or removed from object_tree_root
      and object_list since kmemleak_lock is already held.  There is a very
      small risk that some of the object data is being modified on another CPU
      but the only downside is inconsistent information printing.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d5a4c73
    • C
      mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup() · 5f369f37
      Catalin Marinas 提交于
      The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan
      thread to finish via kthread_stop().  Waiting in kthread_stop() while
      scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also
      waits to acquire for scan_mutex.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f369f37
    • C
      mm: kmemleak: fix delete_object_*() race when called on the same memory block · e781a9ab
      Catalin Marinas 提交于
      Calling delete_object_*() on the same pointer is not a standard use case
      (unless there is a bug in the code calling kmemleak_free()).  However,
      during kmemleak disabling (error or user triggered via /sys), there is a
      potential race between kmemleak_free() calls on a CPU and
      __kmemleak_do_cleanup() on a different CPU.
      
      The current delete_object_*() implementation first performs a look-up
      holding kmemleak_lock, increments the object->use_count and then
      re-acquires kmemleak_lock to remove the object from object_tree_root and
      object_list.
      
      This patch simplifies the delete_object_*() mechanism to both look up
      and remove an object from the object_tree_root and object_list
      atomically (guarded by kmemleak_lock).  This allows safe concurrent
      calls to delete_object_*() on the same pointer without additional
      locking for synchronising the kmemleak_free_enabled flag.
      
      A side effect is a slight improvement in the delete_object_*() performance
      by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
      object->use_count.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e781a9ab