1. 02 6月, 2019 7 次提交
  2. 31 5月, 2019 3 次提交
  3. 24 5月, 2019 1 次提交
  4. 21 5月, 2019 3 次提交
  5. 19 5月, 2019 4 次提交
    • M
      mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock · 60fce36a
      Mel Gorman 提交于
      syzbot reported the following error from a tree with a head commit of
      baf76f0c ("slip: make slhc_free() silently accept an error pointer")
      
        BUG: unable to handle kernel paging request at ffffea0003348000
        #PF error: [normal kernel read fault]
        PGD 12c3f9067 P4D 12c3f9067 PUD 12c3f8067 PMD 0
        Oops: 0000 [#1] PREEMPT SMP KASAN
        CPU: 1 PID: 28916 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #89
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:314 [inline]
        RIP: 0010:PageCompound include/linux/page-flags.h:186 [inline]
        RIP: 0010:isolate_freepages_block+0x1c0/0xd40 mm/compaction.c:579
        Code: 01 d8 ff 4d 85 ed 0f 84 ef 07 00 00 e8 29 00 d8 ff 4c 89 e0 83 85 38 ff
        ff ff 01 48 c1 e8 03 42 80 3c 38 00 0f 85 31 0a 00 00 <4d> 8b 2c 24 31 ff 49
        c1 ed 10 41 83 e5 01 44 89 ee e8 3a 01 d8 ff
        RSP: 0018:ffff88802b31eab8 EFLAGS: 00010246
        RAX: 1ffffd4000669000 RBX: 00000000000cd200 RCX: ffffc9000a235000
        RDX: 000000000001ca5e RSI: ffffffff81988cc7 RDI: 0000000000000001
        RBP: ffff88802b31ebd8 R08: ffff88805af700c0 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0003348000
        R13: 0000000000000000 R14: ffff88802b31f030 R15: dffffc0000000000
        FS:  00007f61648dc700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffffea0003348000 CR3: 0000000037c64000 CR4: 00000000001426e0
        Call Trace:
         fast_isolate_around mm/compaction.c:1243 [inline]
         fast_isolate_freepages mm/compaction.c:1418 [inline]
         isolate_freepages mm/compaction.c:1438 [inline]
         compaction_alloc+0x1aee/0x22e0 mm/compaction.c:1550
      
      There is no reproducer and it is difficult to hit -- 1 crash every few
      days.  The issue is very similar to the fix in commit 6b0868c8
      ("mm/compaction.c: correct zone boundary handling when resetting pageblock
      skip hints").  When isolating free pages around a target pageblock, the
      boundary handling is off by one and can stray into the next pageblock.
      Triggering the syzbot error requires that the end of pageblock is section
      or zone aligned, and that the next section is unpopulated.
      
      A more subtle consequence of the bug is that pageblocks were being
      improperly used as migration targets which potentially hurts fragmentation
      avoidance in the long-term one page at a time.
      
      A debugging patch revealed that it's definitely possible to stray outside
      of a pageblock which is not intended.  While syzbot cannot be used to
      verify this patch, it was confirmed that the debugging warning no longer
      triggers with this patch applied.  It has also been confirmed that the THP
      allocation stress tests are not degraded by this patch.
      
      Link: http://lkml.kernel.org/r/20190510182124.GI18914@techsingularity.net
      Fixes: e332f741 ("mm, compaction: be selective about what pageblocks to clear skip hints")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: syzbot+d84c80f9fe26a0f7a734@syzkaller.appspotmail.com
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org> # v5.1+
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60fce36a
    • U
      mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro · a6cf4e0f
      Uladzislau Rezki (Sony) 提交于
      This macro adds some debug code to check that vmap allocations are
      happened in ascending order.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6cf4e0f
    • U
      mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro · bb850f4d
      Uladzislau Rezki (Sony) 提交于
      This macro adds some debug code to check that the augment tree is
      maintained correctly, meaning that every node contains valid
      subtree_max_size value.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb850f4d
    • U
      mm/vmalloc.c: keep track of free blocks for vmap allocation · 68ad4a33
      Uladzislau Rezki (Sony) 提交于
      Patch series "improve vmap allocation", v3.
      
      Objective
      ---------
      
      Please have a look for the description at:
      
        https://lkml.org/lkml/2018/10/19/786
      
      but let me also summarize it a bit here as well.
      
      The current implementation has O(N) complexity. Requests with different
      permissive parameters can lead to long allocation time. When i say
      "long" i mean milliseconds.
      
      Description
      -----------
      
      This approach organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
      instead of finding a hole between two busy blocks.  It allows to have
      lower number of objects which represent the free space, therefore to have
      less fragmented memory allocator.  Because free blocks are always as large
      as possible.
      
      It uses the augment tree where all free areas are sorted in ascending
      order of va->va_start address in pair with linked list that provides
      O(1) access to prev/next elements.
      
      Since the tree is augment, we also maintain the "subtree_max_size" of VA
      that reflects a maximum available free block in its left or right
      sub-tree.  Knowing that, we can easily traversal toward the lowest (left
      most path) free area.
      
      Allocation: ~O(log(N)) complexity.  It is sequential allocation method
      therefore tends to maximize locality.  The search is done until a first
      suitable block is large enough to encompass the requested parameters.
      Bigger areas are split.
      
      I copy paste here the description of how the area is split, since i
      described it in https://lkml.org/lkml/2018/10/19/786
      
      <snip>
      
      A free block can be split by three different ways.  Their names are
      FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
      correspond to how requested size and alignment fit to a free block.
      
      FL_FIT_TYPE - in this case a free block is just removed from the free
      list/tree because it fully fits.  Comparing with current design there is
      an extra work with rb-tree updating.
      
      LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
      is just cutting a free block.  It is as fast as a current design.  Most of
      the vmalloc allocations just end up with this case, because the edge is
      always aligned to 1.
      
      NE_FIT_TYPE - Is much less common case.  Basically it happens when
      requested size and alignment does not fit left nor right edges, i.e.  it
      is between them.  In this case during splitting we have to build a
      remaining left free area and place it back to the free list/tree.
      
      Comparing with current design there are two extra steps.  First one is we
      have to allocate a new vmap_area structure.  Second one we have to insert
      that remaining free block to the address sorted list/tree.
      
      In order to optimize a first case there is a cache with free_vmap objects.
      Instead of allocating from slab we just take an object from the cache and
      reuse it.
      
      Second one is pretty optimized.  Since we know a start point in the tree
      we do not do a search from the top.  Instead a traversal begins from a
      rb-tree node we split.
      <snip>
      
      De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
      away to the tree/list, instead we identify the spot first, checking if it
      can be merged around neighbors.  The list provides O(1) access to
      prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
      large coalesced areas are created, if not the area is just linked making
      more fragments.
      
      There is one more thing that i should mention here.  After modification of
      VA node, its subtree_max_size is updated if it was/is the biggest area in
      its left or right sub-tree.  Apart of that it can also be populated back
      to upper levels to fix the tree.  For more details please have a look at
      the __augment_tree_propagate_from() function and the description.
      
      Tests and stressing
      -------------------
      
      I use the "test_vmalloc.sh" test driver available under
      "tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
      ./test_vmalloc.sh" to find out how to deal with it.
      
      Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
      Regarding last one, i do not have any physical access to NUMA system,
      therefore i emulated it.  The time of stressing is days.
      
      If you run the test driver in "stress mode", you also need the patch that
      is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:
      
      http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c
      
      After massive testing, i have not identified any problems like memory
      leaks, crashes or kernel panics.  I find it stable, but more testing would
      be good.
      
      Performance analysis
      --------------------
      
      I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
      another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
      Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
      as well, but the results have not been ready by time i an writing this.
      
      Currently it consist of 8 tests.  There are three of them which correspond
      to different types of splitting(to compare with default).  We have 3
      ones(see above).  Another 5 do allocations in different conditions.
      
      a) sudo ./test_vmalloc.sh performance
      
      When the test driver is run in "performance" mode, it runs all available
      tests pinned to first online CPU with sequential execution test order.  We
      do it in order to get stable and repeatable results.  Take a look at time
      difference in "long_busy_list_alloc_test".  It is not surprising because
      the worst case is O(N).
      
      # i5-3320M
      How many cycles all tests took:
      CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
      
      # Hikey960 8x CPUs
      How many cycles all tests took:
      CPU0=3478683207 cycles vs CPU0=463767978 cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
      
      b) time sudo ./test_vmalloc.sh test_repeat_count=1
      
      With this configuration, all tests are run on all available online CPUs.
      Before running each CPU shuffles its tests execution order.  It gives
      random allocation behaviour.  So it is rough comparison, but it puts in
      the picture for sure.
      
      # i5-3320M
      <default>            vs            <patched>
      real    101m22.813s                real    0m56.805s
      user    0m0.011s                   user    0m0.015s
      sys     0m5.076s                   sys     0m0.023s
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
      
      # Hikey960 8x CPUs
      <default>            vs            <patched>
      real    unknown                    real    4m25.214s
      user    unknown                    user    0m0.011s
      sys     unknown                    sys     0m0.670s
      
      I did not manage to complete this test on "default Hikey960" kernel
      version.  After 24 hours it was still running, therefore i had to cancel
      it.  That is why real/user/sys are "unknown".
      
      This patch (of 3):
      
      Currently an allocation of the new vmap area is done over busy list
      iteration(complexity O(n)) until a suitable hole is found between two busy
      areas.  Therefore each new allocation causes the list being grown.  Due to
      over fragmented list and different permissive parameters an allocation can
      take a long time.  For example on embedded devices it is milliseconds.
      
      This patch organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
      sorted by their offsets in pair with linked list keeping the free space in
      order of increasing addresses.
      
      Nodes are augmented with the size of the maximum available free block in
      its left or right sub-tree.  Thus, that allows to take a decision and
      traversal toward the block that will fit and will have the lowest start
      address, i.e.  it is sequential allocation.
      
      Allocation: to allocate a new block a search is done over the tree until a
      suitable lowest(left most) block is large enough to encompass: the
      requested size, alignment and vstart point.  If the block is bigger than
      requested size - it is split.
      
      De-allocation: when a busy vmap area is freed it can either be merged or
      inserted to the tree.  Red-black tree allows efficiently find a spot
      whereas a linked list provides a constant-time access to previous and next
      blocks to check if merging can be done.  In case of merging of
      de-allocated memory chunk a large coalesced area is created.
      
      Complexity: ~O(log(N))
      
      [urezki@gmail.com: v3]
        Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ad4a33
  6. 17 5月, 2019 1 次提交
    • Q
      slab: remove /proc/slab_allocators · 7878c231
      Qian Cai 提交于
      It turned out that DEBUG_SLAB_LEAK is still broken even after recent
      recue efforts that when there is a large number of objects like
      kmemleak_object which is normal on a debug kernel,
      
        # grep kmemleak /proc/slabinfo
        kmemleak_object   2243606 3436210 ...
      
      reading /proc/slab_allocators could easily loop forever while processing
      the kmemleak_object cache and any additional freeing or allocating
      objects will trigger a reprocessing. To make a situation worse,
      soft-lockups could easily happen in this sitatuion which will call
      printk() to allocate more kmemleak objects to guarantee an infinite
      loop.
      
      Also, since it seems no one had noticed when it was totally broken
      more than 2-year ago - see the commit fcf88917 ("slab: fix a crash
      by reading /proc/slab_allocators"), probably nobody cares about it
      anymore due to the decline of the SLAB. Just remove it entirely.
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7878c231
  7. 15 5月, 2019 21 次提交