1. 08 12月, 2020 1 次提交
  2. 16 11月, 2020 1 次提交
  3. 19 10月, 2020 13 次提交
  4. 12 5月, 2020 1 次提交
    • J
      vmalloc: fix remap_vmalloc_range() bounds checks · 7b6a9ab1
      Jann Horn 提交于
      commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.
      
      remap_vmalloc_range() has had various issues with the bounds checks it
      promises to perform ("This function checks that addr is a valid
      vmalloc'ed area, and that it is big enough to cover the vma") over time,
      e.g.:
      
       - not detecting pgoff<<PAGE_SHIFT overflow
      
       - not detecting (pgoff<<PAGE_SHIFT)+usize overflow
      
       - not checking whether addr and addr+(pgoff<<PAGE_SHIFT) are the same
         vmalloc allocation
      
       - comparing a potentially wildly out-of-bounds pointer with the end of
         the vmalloc region
      
      In particular, since commit fc9702273e2e ("bpf: Add mmap() support for
      BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer
      dereferences by calling mmap() on a BPF map with a size that is bigger
      than the distance from the start of the BPF map to the end of the
      address space.
      
      This could theoretically be used as a kernel ASLR bypass, by using
      whether mmap() with a given offset oopses or returns an error code to
      perform a binary search over the possible address range.
      
      To allow remap_vmalloc_range_partial() to verify that addr and
      addr+(pgoff<<PAGE_SHIFT) are in the same vmalloc region, pass the offset
      to remap_vmalloc_range_partial() instead of adding it to the pointer in
      remap_vmalloc_range().
      
      In remap_vmalloc_range_partial(), fix the check against
      get_vm_area_size() by using size comparisons instead of pointer
      comparisons, and add checks for pgoff.
      
      Fixes: 83342314 ("[PATCH] mm: introduce remap_vmalloc_range()")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Yonghong Song <yhs@fb.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: KP Singh <kpsingh@chromium.org>
      Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Conflicts:
        mm/vmalloc.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7b6a9ab1
  5. 26 4月, 2020 1 次提交
  6. 17 4月, 2020 1 次提交
  7. 14 4月, 2020 5 次提交
    • K
      mm/vmalloc.c: fix percpu free VM area search criteria · dc6de49f
      Kuppuswamy Sathyanarayanan 提交于
      mainline inclusion
      from mainline-5.3-rc5
      commit 5336e52c
      category: bugfix
      bugzilla: 15766
      CVE: NA
      
      -------------------------------------------------
      Recent changes to the vmalloc code by commit 68ad4a33
      ("mm/vmalloc.c: keep track of free blocks for vmap allocation") can
      cause spurious percpu allocation failures.  These, in turn, can result
      in panic()s in the slub code.  One such possible panic was reported by
      Dave Hansen in following link https://lkml.org/lkml/2019/6/19/939.
      Another related panic observed is,
      
       RIP: 0033:0x7f46f7441b9b
       Call Trace:
        dump_stack+0x61/0x80
        pcpu_alloc.cold.30+0x22/0x4f
        mem_cgroup_css_alloc+0x110/0x650
        cgroup_apply_control_enable+0x133/0x330
        cgroup_mkdir+0x41b/0x500
        kernfs_iop_mkdir+0x5a/0x90
        vfs_mkdir+0x102/0x1b0
        do_mkdirat+0x7d/0xf0
        do_syscall_64+0x5b/0x180
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      VMALLOC memory manager divides the entire VMALLOC space (VMALLOC_START
      to VMALLOC_END) into multiple VM areas (struct vm_areas), and it mainly
      uses two lists (vmap_area_list & free_vmap_area_list) to track the used
      and free VM areas in VMALLOC space.  And pcpu_get_vm_areas(offsets[],
      sizes[], nr_vms, align) function is used for allocating congruent VM
      areas for percpu memory allocator.  In order to not conflict with
      VMALLOC users, pcpu_get_vm_areas allocates VM areas near the end of the
      VMALLOC space.  So the search for free vm_area for the given requirement
      starts near VMALLOC_END and moves upwards towards VMALLOC_START.
      
      Prior to commit 68ad4a33, the search for free vm_area in
      pcpu_get_vm_areas() involves following two main steps.
      
      Step 1:
          Find a aligned "base" adress near VMALLOC_END.
          va = free vm area near VMALLOC_END
      Step 2:
          Loop through number of requested vm_areas and check,
              Step 2.1:
                 if (base < VMALLOC_START)
                    1. fail with error
              Step 2.2:
                 // end is offsets[area] + sizes[area]
                 if (base + end > va->vm_end)
                     1. Move the base downwards and repeat Step 2
              Step 2.3:
                 if (base + start < va->vm_start)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      But Commit 68ad4a33 removed Step 2.2 and modified Step 2.3 as below:
      
              Step 2.3:
                 if (base + start < va->vm_start || base + end > va->vm_end)
                    1. Move to previous free vm_area node, find aligned
                       base address and repeat Step 2
      
      Above change is the root cause of spurious percpu memory allocation
      failures.  For example, consider a case where a relatively large vm_area
      (~ 30 TB) was ignored in free vm_area search because it did not pass the
      base + end < vm->vm_end boundary check.  Ignoring such large free
      vm_area's would lead to not finding free vm_area within boundary of
      VMALLOC_start to VMALLOC_END which in turn leads to allocation failures.
      
      So modify the search algorithm to include Step 2.2.
      
      Link: http://lkml.kernel.org/r/20190729232139.91131-1-sathyanarayanan.kuppuswamy@linux.intel.com
      Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: sathyanarayanan kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 5336e52c)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      dc6de49f
    • A
      mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning · 95cdbea7
      Arnd Bergmann 提交于
      mainline inclusion
      from mainline-5.2-rc7
      commit 2c929233
      category: bugfix
      bugzilla: 15766
      CVE: NA
      
      -------------------------------------------------
      gcc gets confused in pcpu_get_vm_areas() because there are too many
      branches that affect whether 'lva' was initialized before it gets used:
      
        mm/vmalloc.c: In function 'pcpu_get_vm_areas':
        mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
            insert_vmap_area_augment(lva, &va->rb_node,
            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
             &free_vmap_area_root, &free_vmap_area_list);
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        mm/vmalloc.c:916:20: note: 'lva' was declared here
          struct vmap_area *lva;
                            ^~~
      
      Add an intialization to NULL, and check whether this has changed before
      the first use.
      
      [akpm@linux-foundation.org: tweak comments]
      Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
      Fixes: 68ad4a33 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 2c929233)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      95cdbea7
    • U
      mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro · ba00c205
      Uladzislau Rezki (Sony) 提交于
      mainline inclusion
      from mainline-5.2-rc1
      commit a6cf4e0f
      category: bugfix
      bugzilla: 15766
      CVE: NA
      
      -------------------------------------------------
      This macro adds some debug code to check that vmap allocations are
      happened in ascending order.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-4-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-4-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit a6cf4e0f)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ba00c205
    • U
      mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro · 36faa5cd
      Uladzislau Rezki (Sony) 提交于
      mainline inclusion
      from mainline-5.2-rc1
      commit bb850f4d
      category: bugfix
      bugzilla: 15766
      CVE: NA
      
      -------------------------------------------------
      This macro adds some debug code to check that the augment tree is
      maintained correctly, meaning that every node contains valid
      subtree_max_size value.
      
      By default this option is set to 0 and not active.  It requires
      recompilation of the kernel to activate it.  Set to 1, compile the
      kernel.
      
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-3-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190402162531.10888-3-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit bb850f4d)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      36faa5cd
    • U
      mm/vmalloc.c: keep track of free blocks for vmap allocation · 1babe76a
      Uladzislau Rezki (Sony) 提交于
      mainline inclusion
      from mainline-5.2-rc1
      commit 68ad4a33
      category: bugfix
      bugzilla: 15766
      CVE: NA
      
      -------------------------------------------------
      Patch series "improve vmap allocation", v3.
      
      Objective
      ---------
      
      Please have a look for the description at:
      
        https://lkml.org/lkml/2018/10/19/786
      
      but let me also summarize it a bit here as well.
      
      The current implementation has O(N) complexity. Requests with different
      permissive parameters can lead to long allocation time. When i say
      "long" i mean milliseconds.
      
      Description
      -----------
      
      This approach organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range, i.e.  an allocation is done over free areas lookups,
      instead of finding a hole between two busy blocks.  It allows to have
      lower number of objects which represent the free space, therefore to have
      less fragmented memory allocator.  Because free blocks are always as large
      as possible.
      
      It uses the augment tree where all free areas are sorted in ascending
      order of va->va_start address in pair with linked list that provides
      O(1) access to prev/next elements.
      
      Since the tree is augment, we also maintain the "subtree_max_size" of VA
      that reflects a maximum available free block in its left or right
      sub-tree.  Knowing that, we can easily traversal toward the lowest (left
      most path) free area.
      
      Allocation: ~O(log(N)) complexity.  It is sequential allocation method
      therefore tends to maximize locality.  The search is done until a first
      suitable block is large enough to encompass the requested parameters.
      Bigger areas are split.
      
      I copy paste here the description of how the area is split, since i
      described it in https://lkml.org/lkml/2018/10/19/786
      
      <snip>
      
      A free block can be split by three different ways.  Their names are
      FL_FIT_TYPE, LE_FIT_TYPE/RE_FIT_TYPE and NE_FIT_TYPE, i.e.  they
      correspond to how requested size and alignment fit to a free block.
      
      FL_FIT_TYPE - in this case a free block is just removed from the free
      list/tree because it fully fits.  Comparing with current design there is
      an extra work with rb-tree updating.
      
      LE_FIT_TYPE/RE_FIT_TYPE - left/right edges fit.  In this case what we do
      is just cutting a free block.  It is as fast as a current design.  Most of
      the vmalloc allocations just end up with this case, because the edge is
      always aligned to 1.
      
      NE_FIT_TYPE - Is much less common case.  Basically it happens when
      requested size and alignment does not fit left nor right edges, i.e.  it
      is between them.  In this case during splitting we have to build a
      remaining left free area and place it back to the free list/tree.
      
      Comparing with current design there are two extra steps.  First one is we
      have to allocate a new vmap_area structure.  Second one we have to insert
      that remaining free block to the address sorted list/tree.
      
      In order to optimize a first case there is a cache with free_vmap objects.
      Instead of allocating from slab we just take an object from the cache and
      reuse it.
      
      Second one is pretty optimized.  Since we know a start point in the tree
      we do not do a search from the top.  Instead a traversal begins from a
      rb-tree node we split.
      <snip>
      
      De-allocation.  ~O(log(N)) complexity.  An area is not inserted straight
      away to the tree/list, instead we identify the spot first, checking if it
      can be merged around neighbors.  The list provides O(1) access to
      prev/next, so it is pretty fast to check it.  Summarizing.  If merged then
      large coalesced areas are created, if not the area is just linked making
      more fragments.
      
      There is one more thing that i should mention here.  After modification of
      VA node, its subtree_max_size is updated if it was/is the biggest area in
      its left or right sub-tree.  Apart of that it can also be populated back
      to upper levels to fix the tree.  For more details please have a look at
      the __augment_tree_propagate_from() function and the description.
      
      Tests and stressing
      -------------------
      
      I use the "test_vmalloc.sh" test driver available under
      "tools/testing/selftests/vm/" since 5.1-rc1 kernel.  Just trigger "sudo
      ./test_vmalloc.sh" to find out how to deal with it.
      
      Tested on different platforms including x86_64/i686/ARM64/x86_64_NUMA.
      Regarding last one, i do not have any physical access to NUMA system,
      therefore i emulated it.  The time of stressing is days.
      
      If you run the test driver in "stress mode", you also need the patch that
      is in Andrew's tree but not in Linux 5.1-rc1.  So, please apply it:
      
      http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/commit/?id=e0cf7749bade6da318e98e934a24d8b62fab512c
      
      After massive testing, i have not identified any problems like memory
      leaks, crashes or kernel panics.  I find it stable, but more testing would
      be good.
      
      Performance analysis
      --------------------
      
      I have used two systems to test.  One is i5-3320M CPU @ 2.60GHz and
      another is HiKey960(arm64) board.  i5-3320M runs on 4.20 kernel, whereas
      Hikey960 uses 4.15 kernel.  I have both system which could run on 5.1-rc1
      as well, but the results have not been ready by time i an writing this.
      
      Currently it consist of 8 tests.  There are three of them which correspond
      to different types of splitting(to compare with default).  We have 3
      ones(see above).  Another 5 do allocations in different conditions.
      
      a) sudo ./test_vmalloc.sh performance
      
      When the test driver is run in "performance" mode, it runs all available
      tests pinned to first online CPU with sequential execution test order.  We
      do it in order to get stable and repeatable results.  Take a look at time
      difference in "long_busy_list_alloc_test".  It is not surprising because
      the worst case is O(N).
      
      # i5-3320M
      How many cycles all tests took:
      CPU0=646919905370(default) cycles vs CPU0=193290498550(patched) cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_performance_patched.txt
      
      # Hikey960 8x CPUs
      How many cycles all tests took:
      CPU0=3478683207 cycles vs CPU0=463767978 cycles
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/HiKey960_performance_patched.txt
      
      b) time sudo ./test_vmalloc.sh test_repeat_count=1
      
      With this configuration, all tests are run on all available online CPUs.
      Before running each CPU shuffles its tests execution order.  It gives
      random allocation behaviour.  So it is rough comparison, but it puts in
      the picture for sure.
      
      # i5-3320M
      <default>            vs            <patched>
      real    101m22.813s                real    0m56.805s
      user    0m0.011s                   user    0m0.015s
      sys     0m5.076s                   sys     0m0.023s
      
      # See detailed table with results here:
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_default.txt
      ftp://vps418301.ovh.net/incoming/vmap_test_results_v2/i5-3320M_test_repeat_count_1_patched.txt
      
      # Hikey960 8x CPUs
      <default>            vs            <patched>
      real    unknown                    real    4m25.214s
      user    unknown                    user    0m0.011s
      sys     unknown                    sys     0m0.670s
      
      I did not manage to complete this test on "default Hikey960" kernel
      version.  After 24 hours it was still running, therefore i had to cancel
      it.  That is why real/user/sys are "unknown".
      
      This patch (of 3):
      
      Currently an allocation of the new vmap area is done over busy list
      iteration(complexity O(n)) until a suitable hole is found between two busy
      areas.  Therefore each new allocation causes the list being grown.  Due to
      over fragmented list and different permissive parameters an allocation can
      take a long time.  For example on embedded devices it is milliseconds.
      
      This patch organizes the KVA memory layout into free areas of the
      1-ULONG_MAX range.  It uses an augment red-black tree that keeps blocks
      sorted by their offsets in pair with linked list keeping the free space in
      order of increasing addresses.
      
      Nodes are augmented with the size of the maximum available free block in
      its left or right sub-tree.  Thus, that allows to take a decision and
      traversal toward the block that will fit and will have the lowest start
      address, i.e.  it is sequential allocation.
      
      Allocation: to allocate a new block a search is done over the tree until a
      suitable lowest(left most) block is large enough to encompass: the
      requested size, alignment and vstart point.  If the block is bigger than
      requested size - it is split.
      
      De-allocation: when a busy vmap area is freed it can either be merged or
      inserted to the tree.  Red-black tree allows efficiently find a spot
      whereas a linked list provides a constant-time access to previous and next
      blocks to check if merging can be done.  In case of merging of
      de-allocated memory chunk a large coalesced area is created.
      
      Complexity: ~O(log(N))
      
      [urezki@gmail.com: v3]
        Link: http://lkml.kernel.org/r/20190402162531.10888-2-urezki@gmail.com
      [urezki@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20190406183508.25273-2-urezki@gmail.com
      Link: http://lkml.kernel.org/r/20190321190327.11813-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 68ad4a33)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1babe76a
  8. 27 12月, 2019 4 次提交
  9. 18 8月, 2018 1 次提交
  10. 15 6月, 2018 1 次提交
  11. 08 6月, 2018 3 次提交
  12. 16 5月, 2018 2 次提交
  13. 22 2月, 2018 1 次提交
  14. 14 10月, 2017 1 次提交
    • J
      Revert "vmalloc: back off when the current task is killed" · b8c8a338
      Johannes Weiner 提交于
      This reverts commits 5d17a73a ("vmalloc: back off when the current
      task is killed") and 171012f5 ("mm: don't warn when vmalloc() fails
      due to a fatal signal").
      
      Commit 5d17a73a ("vmalloc: back off when the current task is
      killed") made all vmalloc allocations from a signal-killed task fail.
      We have seen crashes in the tty driver from this, where a killed task
      exiting tries to switch back to N_TTY, fails n_tty_open because of the
      vmalloc failing, and later crashes when dereferencing tty->disc_data.
      
      Arguably, relying on a vmalloc() call to succeed in order to properly
      exit a task is not the most robust way of doing things.  There will be a
      follow-up patch to the tty code to fall back to the N_NULL ldisc.
      
      But the justification to make that vmalloc() call fail like this isn't
      convincing, either.  The patch mentions an OOM victim exhausting the
      memory reserves and thus deadlocking the machine.  But the OOM killer is
      only one, improbable source of fatal signals.  It doesn't make sense to
      fail allocations preemptively with plenty of memory in most cases.
      
      The patch doesn't mention real-life instances where vmalloc sites would
      exhaust memory, which makes it sound more like a theoretical issue to
      begin with.  But just in case, the OOM access to memory reserves has
      been restricted on the allocator side in cd04ae1e ("mm, oom: do not
      rely on TIF_MEMDIE for memory reserves access"), which should take care
      of any theoretical concerns on that front.
      
      Revert this patch, and the follow-up that suppresses the allocation
      warnings when we fail the allocations due to a signal.
      
      Link: http://lkml.kernel.org/r/20171004185906.GB2136@cmpxchg.org
      Fixes:  171012f5 ("mm: don't warn when vmalloc() fails due to a fatal signal")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alan Cox <alan@llwyncelyn.cymru>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8c8a338
  15. 07 9月, 2017 2 次提交
  16. 19 8月, 2017 1 次提交
  17. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04