1. 03 7月, 2019 4 次提交
  2. 22 6月, 2019 1 次提交
    • A
      coredump: fix race condition between collapse_huge_page() and core dumping · 465ce9a5
      Andrea Arcangeli 提交于
      commit 59ea6d06cfa9247b586a695c21f94afa7183af74 upstream.
      
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      465ce9a5
  3. 19 6月, 2019 2 次提交
    • M
      mm/vmscan.c: fix trying to reclaim unevictable LRU page · 54a20289
      Minchan Kim 提交于
      commit a58f2cef26e1ca44182c8b22f4f4395e702a5795 upstream.
      
      There was the below bug report from Wu Fangsuo.
      
      On the CMA allocation path, isolate_migratepages_range() could isolate
      unevictable LRU pages and reclaim_clean_page_from_list() can try to
      reclaim them if they are clean file-backed pages.
      
        page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
        flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
        raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
        raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
        page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
        page->mem_cgroup:ffffffc09bf3ee80
        ------------[ cut here ]------------
        kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S              4.14.81 #3
        Hardware name: ASR AQUILAC EVB (DT)
        task: ffffffc00a54cd00 task.stack: ffffffc009b00000
        PC is at shrink_page_list+0x1998/0x3240
        LR is at shrink_page_list+0x1998/0x3240
        pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045
        sp : ffffffc009b05940
        ..
           shrink_page_list+0x1998/0x3240
           reclaim_clean_pages_from_list+0x3c0/0x4f0
           alloc_contig_range+0x3bc/0x650
           cma_alloc+0x214/0x668
           ion_cma_allocate+0x98/0x1d8
           ion_alloc+0x200/0x7e0
           ion_ioctl+0x18c/0x378
           do_vfs_ioctl+0x17c/0x1780
           SyS_ioctl+0xac/0xc0
      
      Wu found it's due to commit ad6b6704 ("mm: remove SWAP_MLOCK in
      ttu").  Before that, unevictable pages go to cull_mlocked so that we
      can't reach the VM_BUG_ON_PAGE line.
      
      To fix the issue, this patch filters out unevictable LRU pages from the
      reclaim_clean_pages_from_list in CMA.
      
      Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
      Fixes: ad6b6704 ("mm: remove SWAP_MLOCK in ttu")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Debugged-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Tested-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54a20289
    • S
      mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node · 553a1f0d
      Shakeel Butt 提交于
      commit 3510955b327176fd4cbab5baa75b449f077722a2 upstream.
      
      Syzbot reported following memory leak:
      
      ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
      BUG: memory leak
      unreferenced object 0xffff888114f26040 (size 32):
        comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
        hex dump (first 32 bytes):
          40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff  @`......@`......
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           slab_post_alloc_hook mm/slab.h:439 [inline]
           slab_alloc mm/slab.c:3326 [inline]
           kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
           kmalloc include/linux/slab.h:547 [inline]
           __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
           memcg_init_list_lru_node mm/list_lru.c:375 [inline]
           memcg_init_list_lru mm/list_lru.c:459 [inline]
           __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
           alloc_super+0x2e0/0x310 fs/super.c:269
           sget_userns+0x94/0x2a0 fs/super.c:609
           sget+0x8d/0xb0 fs/super.c:660
           mount_nodev+0x31/0xb0 fs/super.c:1387
           fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
           legacy_get_tree+0x27/0x80 fs/fs_context.c:661
           vfs_get_tree+0x2e/0x120 fs/super.c:1476
           do_new_mount fs/namespace.c:2790 [inline]
           do_mount+0x932/0xc50 fs/namespace.c:3110
           ksys_mount+0xab/0x120 fs/namespace.c:3319
           __do_sys_mount fs/namespace.c:3333 [inline]
           __se_sys_mount fs/namespace.c:3330 [inline]
           __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
           do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is a simple off by one bug on the error path.
      
      Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      553a1f0d
  4. 15 6月, 2019 10 次提交
    • D
      percpu: do not search past bitmap when allocating an area · 526972e9
      Dennis Zhou 提交于
      [ Upstream commit 8c43004af01635cc9fbb11031d070e5e0d327ef2 ]
      
      pcpu_find_block_fit() guarantees that a fit is found within
      PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
      it compares against the block's contig_hint. This can lead to
      incorrectly scanning past the end of the bitmap. The behavior was okay
      given the check after for bit_off >= end and the correctness of the
      hints from pcpu_find_block_fit().
      
      This patch fixes this by bounding the end offset by the number of bits
      in a chunk.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NPeng Fan <peng.fan@nxp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      526972e9
    • J
      percpu: remove spurious lock dependency between percpu and sched · 5329dcaf
      John Sperbeck 提交于
      [ Upstream commit 198790d9a3aeaef5792d33a560020861126edc22 ]
      
      In free_percpu() we sometimes call pcpu_schedule_balance_work() to
      queue a work item (which does a wakeup) while holding pcpu_lock.
      This creates an unnecessary lock dependency between pcpu_lock and
      the scheduler's pi_lock.  There are other places where we call
      pcpu_schedule_balance_work() without hold pcpu_lock, and this case
      doesn't need to be different.
      
      Moving the call outside the lock prevents the following lockdep splat
      when running tools/testing/selftests/bpf/{test_maps,test_progs} in
      sequence with lockdep enabled:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.1.0-dbg-DEV #1 Not tainted
      ------------------------------------------------------
      kworker/23:255/18872 is trying to acquire lock:
      000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520
      
      but task is already holding lock:
      00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (pcpu_lock){..-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             pcpu_alloc+0xfa/0x780
             __alloc_percpu_gfp+0x12/0x20
             alloc_htab_elem+0x184/0x2b0
             __htab_percpu_map_update_elem+0x252/0x290
             bpf_percpu_hash_update+0x7c/0x130
             __do_sys_bpf+0x1912/0x1be0
             __x64_sys_bpf+0x1a/0x20
             do_syscall_64+0x59/0x400
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #3 (&htab->buckets[i].lock){....}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             htab_map_update_elem+0x1af/0x3a0
      
      -> #2 (&rq->lock){-.-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock+0x2f/0x40
             task_fork_fair+0x37/0x160
             sched_fork+0x211/0x310
             copy_process.part.43+0x7b1/0x2160
             _do_fork+0xda/0x6b0
             kernel_thread+0x29/0x30
             rest_init+0x22/0x260
             arch_call_rest_init+0xe/0x10
             start_kernel+0x4fd/0x520
             x86_64_start_reservations+0x24/0x26
             x86_64_start_kernel+0x6f/0x72
             secondary_startup_64+0xa4/0xb0
      
      -> #1 (&p->pi_lock){-.-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             try_to_wake_up+0x41/0x600
             wake_up_process+0x15/0x20
             create_worker+0x16b/0x1e0
             workqueue_init+0x279/0x2ee
             kernel_init_freeable+0xf7/0x288
             kernel_init+0xf/0x180
             ret_from_fork+0x24/0x30
      
      -> #0 (&(&pool->lock)->rlock){-.-.}:
             __lock_acquire+0x101f/0x12a0
             lock_acquire+0x9e/0x180
             _raw_spin_lock+0x2f/0x40
             __queue_work+0xb2/0x520
             queue_work_on+0x38/0x80
             free_percpu+0x221/0x260
             pcpu_freelist_destroy+0x11/0x20
             stack_map_free+0x2a/0x40
             bpf_map_free_deferred+0x3c/0x50
             process_one_work+0x1f7/0x580
             worker_thread+0x54/0x410
             kthread+0x10f/0x150
             ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcpu_lock);
                                     lock(&htab->buckets[i].lock);
                                     lock(pcpu_lock);
        lock(&(&pool->lock)->rlock);
      
       *** DEADLOCK ***
      
      3 locks held by kworker/23:255/18872:
       #0: 00000000b36a6e16 ((wq_completion)events){+.+.},
           at: process_one_work+0x17a/0x580
       #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.},
           at: process_one_work+0x17a/0x580
       #2: 00000000e3e7a6aa (pcpu_lock){..-.},
           at: free_percpu+0x36/0x260
      
      stack backtrace:
      CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1
      Hardware name: ...
      Workqueue: events bpf_map_free_deferred
      Call Trace:
       dump_stack+0x67/0x95
       print_circular_bug.isra.38+0x1c6/0x220
       check_prev_add.constprop.50+0x9f6/0xd20
       __lock_acquire+0x101f/0x12a0
       lock_acquire+0x9e/0x180
       _raw_spin_lock+0x2f/0x40
       __queue_work+0xb2/0x520
       queue_work_on+0x38/0x80
       free_percpu+0x221/0x260
       pcpu_freelist_destroy+0x11/0x20
       stack_map_free+0x2a/0x40
       bpf_map_free_deferred+0x3c/0x50
       process_one_work+0x1f7/0x580
       worker_thread+0x54/0x410
       kthread+0x10f/0x150
       ret_from_fork+0x24/0x30
      Signed-off-by: NJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5329dcaf
    • Q
      mm/slab.c: fix an infinite loop in leaks_show() · 515d18ce
      Qian Cai 提交于
      [ Upstream commit 745e10146c31b1c6ed3326286704ae251b17f663 ]
      
      "cat /proc/slab_allocators" could hang forever on SMP machines with
      kmemleak or object debugging enabled due to other CPUs running do_drain()
      will keep making kmemleak_object or debug_objects_cache dirty and unable
      to escape the first loop in leaks_show(),
      
      do {
      	set_store_user_clean(cachep);
      	drain_cpu_caches(cachep);
      	...
      
      } while (!is_store_user_clean(cachep));
      
      For example,
      
      do_drain
        slabs_destroy
          slab_destroy
            kmem_cache_free
              __cache_free
                ___cache_free
                  kmemleak_free_recursive
                    delete_object_full
                      __delete_object
                        put_object
                          free_object_rcu
                            kmem_cache_free
                              cache_free_debugcheck --> dirty kmemleak_object
      
      One approach is to check cachep->name and skip both kmemleak_object and
      debug_objects_cache in leaks_show().  The other is to set store_user_clean
      after drain_cpu_caches() which leaves a small window between
      drain_cpu_caches() and set_store_user_clean() where per-CPU caches could
      be dirty again lead to slightly wrong information has been stored but
      could also speed up things significantly which sounds like a good
      compromise.  For example,
      
       # cat /proc/slab_allocators
       0m42.778s # 1st approach
       0m0.737s  # 2nd approach
      
      [akpm@linux-foundation.org: tweak comment]
      Link: http://lkml.kernel.org/r/20190411032635.10325-1-cai@lca.pw
      Fixes: d31676df ("mm/slab: alternative implementation for DEBUG_SLAB_LEAK")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      515d18ce
    • Y
      mm/cma_debug.c: fix the break condition in cma_maxchunk_get() · 13e1ea08
      Yue Hu 提交于
      [ Upstream commit f0fd50504a54f5548eb666dc16ddf8394e44e4b7 ]
      
      If not find zero bit in find_next_zero_bit(), it will return the size
      parameter passed in, so the start bit should be compared with bitmap_maxno
      rather than cma->count.  Although getting maxchunk is working fine due to
      zero value of order_per_bit currently, the operation will be stuck if
      order_per_bit is set as non-zero.
      
      Link: http://lkml.kernel.org/r/20190319092734.276-1-zbestahu@gmail.comSigned-off-by: NYue Hu <huyue2@yulong.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Safonov <d.safonov@partner.samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      13e1ea08
    • A
      mm: page_mkclean vs MADV_DONTNEED race · 38c5fce7
      Aneesh Kumar K.V 提交于
      [ Upstream commit 024eee0e83f0df52317be607ca521e0fc572aa07 ]
      
      MADV_DONTNEED is handled with mmap_sem taken in read mode.  We call
      page_mkclean without holding mmap_sem.
      
      MADV_DONTNEED implies that pages in the region are unmapped and subsequent
      access to the pages in that range is handled as a new page fault.  This
      implies that if we don't have parallel access to the region when
      MADV_DONTNEED is run we expect those range to be unallocated.
      
      w.r.t page_mkclean() we need to make sure that we don't break the
      MADV_DONTNEED semantics.  MADV_DONTNEED check for pmd_none without holding
      pmd_lock.  This implies we skip the pmd if we temporarily mark pmd none.
      Avoid doing that while marking the page clean.
      
      Keep the sequence same for dax too even though we don't support
      MADV_DONTNEED for dax mapping
      
      The bug was noticed by code review and I didn't observe any failures w.r.t
      test run.  This is similar to
      
      commit 58ceeb6b
      Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Date:   Thu Apr 13 14:56:26 2017 -0700
      
          thp: fix MADV_DONTNEED vs. MADV_FREE race
      
      commit ced10803
      Author: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Date:   Thu Apr 13 14:56:20 2017 -0700
      
          thp: fix MADV_DONTNEED vs. numa balancing race
      
      Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc:"Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38c5fce7
    • Y
      mm/cma.c: fix the bitmap status to show failed allocation reason · 77a01e33
      Yue Hu 提交于
      [ Upstream commit 2b59e01a3aa665f751d1410b99fae9336bd424e1 ]
      
      Currently one bit in cma bitmap represents number of pages rather than
      one page, cma->count means cma size in pages. So to find available pages
      via find_next_zero_bit()/find_next_bit() we should use cma size not in
      pages but in bits although current free pages number is correct due to
      zero value of order_per_bit. Once order_per_bit is changed the bitmap
      status will be incorrect.
      
      The size input in cma_debug_show_areas() is not correct.  It will
      affect the available pages at some position to debug the failure issue.
      
      This is an example with order_per_bit = 1
      
      Before this change:
      [    4.120060] cma: number of available pages: 1@93+4@108+7@121+7@137+7@153+7@169+7@185+7@201+3@213+3@221+3@229+3@237+3@245+3@253+3@261+3@269+3@277+3@285+3@293+3@301+3@309+3@317+3@325+19@333+15@369+512@512=> 638 free of 1024 total pages
      
      After this change:
      [    4.143234] cma: number of available pages: 2@93+8@108+14@121+14@137+14@153+14@169+14@185+14@201+6@213+6@221+6@229+6@237+6@245+6@253+6@261+6@269+6@277+6@285+6@293+6@301+6@309+6@317+6@325+38@333+30@369=> 252 free of 1024 total pages
      
      Obviously the bitmap status before is incorrect.
      
      Link: http://lkml.kernel.org/r/20190320060829.9144-1-zbestahu@gmail.comSigned-off-by: NYue Hu <huyue2@yulong.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      77a01e33
    • Y
      mm/cma.c: fix crash on CMA allocation if bitmap allocation fails · e5f8857e
      Yue Hu 提交于
      [ Upstream commit 1df3a339074e31db95c4790ea9236874b13ccd87 ]
      
      f022d8cb ("mm: cma: Don't crash on allocation if CMA area can't be
      activated") fixes the crash issue when activation fails via setting
      cma->count as 0, same logic exists if bitmap allocation fails.
      
      Link: http://lkml.kernel.org/r/20190325081309.6004-1-zbestahu@gmail.comSigned-off-by: NYue Hu <huyue2@yulong.com>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e5f8857e
    • L
      mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE · 5094a85d
      Linxu Fang 提交于
      [ Upstream commit 299c83dce9ea3a79bb4b5511d2cb996b6b8e5111 ]
      
      342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
      later patches rewrote the calculation of node spanned pages.
      
      e506b996 ("mem-hotplug: fix node spanned pages when we have a movable
      node"), but the current code still has problems,
      
      When we have a node with only zone_movable and the node id is not zero,
      the size of node spanned pages is double added.
      
      That's because we have an empty normal zone, and zone_start_pfn or
      zone_end_pfn is not between arch_zone_lowest_possible_pfn and
      arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
      range just like the commit <96e907d1> (bootmem: Reimplement
      __absent_pages_in_range() using for_each_mem_pfn_range()).
      
      e.g.
      Zone ranges:
        DMA      [mem 0x0000000000001000-0x0000000000ffffff]
        DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
        Normal   [mem 0x0000000100000000-0x000000023fffffff]
      Movable zone start for each node
        Node 0: 0x0000000100000000
        Node 1: 0x0000000140000000
      Early memory node ranges
        node   0: [mem 0x0000000000001000-0x000000000009efff]
        node   0: [mem 0x0000000000100000-0x00000000bffdffff]
        node   0: [mem 0x0000000100000000-0x000000013fffffff]
        node   1: [mem 0x0000000140000000-0x000000023fffffff]
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0x100000    present:0x100000	absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 2097152
      node_spanned_pages:2097152
      Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
      cma-reserved)
      
      It shows that the current memory of node 1 is double added.
      After this patch, the problem is fixed.
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0	    present:0		absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 1048576
      node_spanned_pages:1048576
      memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
      cma-reserved)
      
      Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.comSigned-off-by: NLinxu Fang <fanglinxu@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5094a85d
    • M
      hugetlbfs: on restore reserve error path retain subpool reservation · ffaafd27
      Mike Kravetz 提交于
      [ Upstream commit 0919e1b69ab459e06df45d3ba6658d281962db80 ]
      
      When a huge page is allocated, PagePrivate() is set if the allocation
      consumed a reservation.  When freeing a huge page, PagePrivate is checked.
      If set, it indicates the reservation should be restored.  PagePrivate
      being set at free huge page time mostly happens on error paths.
      
      When huge page reservations are created, a check is made to determine if
      the mapping is associated with an explicitly mounted filesystem.  If so,
      pages are also reserved within the filesystem.  The default action when
      freeing a huge page is to decrement the usage count in any associated
      explicitly mounted filesystem.  However, if the reservation is to be
      restored the reservation/use count within the filesystem should not be
      decrementd.  Otherwise, a subsequent page allocation and free for the same
      mapping location will cause the file filesystem usage to go 'negative'.
      
      Filesystem                         Size  Used Avail Use% Mounted on
      nodev                              4.0G -4.0M  4.1G    - /opt/hugepool
      
      To fix, when freeing a huge page do not adjust filesystem usage if
      PagePrivate() is set to indicate the reservation should be restored.
      
      I did not cc stable as the problem has been around since reserves were
      added to hugetlbfs and nobody has noticed.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ffaafd27
    • J
      mm/hmm: select mmu notifier when selecting HMM · 85e1a6c4
      Jérôme Glisse 提交于
      [ Upstream commit 734fb89968900b5c5f8edd5038bd4cdeab8c61d2 ]
      
      To avoid random config build issue, select mmu notifier when HMM is
      selected.  In any cases when HMM get selected it will be by users that
      will also wants the mmu notifier.
      
      Link: http://lkml.kernel.org/r/20190403193318.16478-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      85e1a6c4
  5. 09 6月, 2019 1 次提交
    • J
      memcg: make it work on sparse non-0-node systems · 8b057ad8
      Jiri Slaby 提交于
      commit 3e8589963773a5c23e2f1fe4bcad0e9a90b7f471 upstream.
      
      We have a single node system with node 0 disabled:
        Scanning NUMA topology in Northbridge 24
        Number of physical nodes 2
        Skipping disabled node 0
        Node 1 MemBase 0000000000000000 Limit 00000000fbff0000
        NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff]
      
      This causes crashes in memcg when system boots:
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
      ...
        RIP: 0010:list_lru_add+0x94/0x170
      ...
        Call Trace:
         d_lru_add+0x44/0x50
         dput.part.34+0xfc/0x110
         __fput+0x108/0x230
         task_work_run+0x9f/0xc0
         exit_to_usermode_loop+0xf5/0x100
      
      It is reproducible as far as 4.12.  I did not try older kernels.  You have
      to have a new enough systemd, e.g.  241 (the reason is unknown -- was not
      investigated).  Cannot be reproduced with systemd 234.
      
      The system crashes because the size of lru array is never updated in
      memcg_update_all_list_lrus and the reads are past the zero-sized array,
      causing dereferences of random memory.
      
      The root cause are list_lru_memcg_aware checks in the list_lru code.  The
      test in list_lru_memcg_aware is broken: it assumes node 0 is always
      present, but it is not true on some systems as can be seen above.
      
      So fix this by avoiding checks on node 0.  Remember the memcg-awareness by
      a bool flag in struct list_lru.
      
      Link: http://lkml.kernel.org/r/20190522091940.3615-1-jslaby@suse.cz
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b057ad8
  6. 22 5月, 2019 4 次提交
  7. 17 5月, 2019 3 次提交
  8. 10 5月, 2019 1 次提交
  9. 08 5月, 2019 2 次提交
  10. 04 5月, 2019 1 次提交
  11. 02 5月, 2019 1 次提交
    • J
      mm: Fix warning in insert_pfn() · 423497a9
      Jan Kara 提交于
      commit f2c57d91b0d96aa13ccff4e3b178038f17b00658 upstream.
      
      In DAX mode a write pagefault can race with write(2) in the following
      way:
      
      CPU0                            CPU1
                                      write fault for mapped zero page (hole)
      dax_iomap_rw()
        iomap_apply()
          xfs_file_iomap_begin()
            - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - invalidates radix tree entries in given range
                                      dax_iomap_pte_fault()
                                        grab_mapping_entry()
                                          - no entry found, creates empty
                                        ...
                                        xfs_file_iomap_begin()
                                          - finds already allocated block
                                        ...
                                        vmf_insert_mixed_mkwrite()
                                          - WARNs and does nothing because there
                                            is still zero page mapped in PTE
              unmap_mapping_pages()
      
      This race results in WARN_ON from insert_pfn() and is occasionally
      triggered by fstest generic/344. Note that the race is otherwise
      harmless as before write(2) on CPU0 is finished, we will invalidate page
      tables properly and thus user of mmap will see modified data from
      write(2) from that point on. So just restrict the warning only to the
      case when the PFN in PTE is not zero page.
      
      Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      423497a9
  12. 27 4月, 2019 3 次提交
  13. 20 4月, 2019 1 次提交
    • R
      mm: hide incomplete nr_indirectly_reclaimable in /proc/zoneinfo · d49dea54
      Roman Gushchin 提交于
      [fixed differently upstream, this is a work-around to resolve it for 4.19.y]
      
      Yongqin reported that /proc/zoneinfo format is broken in 4.14
      due to commit 7aaf7727 ("mm: don't show nr_indirectly_reclaimable
      in /proc/vmstat")
      
      Node 0, zone      DMA
        per-node stats
            nr_inactive_anon 403
            nr_active_anon 89123
            nr_inactive_file 128887
            nr_active_file 47377
            nr_unevictable 2053
            nr_slab_reclaimable 7510
            nr_slab_unreclaimable 10775
            nr_isolated_anon 0
            nr_isolated_file 0
            <...>
            nr_vmscan_write 0
            nr_vmscan_immediate_reclaim 0
            nr_dirtied   6022
            nr_written   5985
                         74240
            ^^^^^^^^^^
        pages free     131656
      
      The problem is caused by the nr_indirectly_reclaimable counter,
      which is hidden from the /proc/vmstat, but not from the
      /proc/zoneinfo. Let's fix this inconsistency and hide the
      counter from /proc/zoneinfo exactly as from /proc/vmstat.
      
      BTW, in 4.19+ the counter has been renamed and exported by
      the commit b29940c1abd7 ("mm: rename and change semantics of
      nr_indirectly_reclaimable_bytes"), so there is no such a problem
      anymore.
      
      Cc: <stable@vger.kernel.org> # 4.14.x-4.18.x
      Fixes: 7aaf7727 ("mm: don't show nr_indirectly_reclaimable in /proc/vmstat")
      Reported-by: NYongqin Liu <yongqin.liu@linaro.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d49dea54
  14. 17 4月, 2019 2 次提交
    • G
      mm: writeback: use exact memcg dirty counts · 43f47331
      Greg Thelen 提交于
      commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.
      
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43f47331
    • A
      mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() · 9a62d691
      Aneesh Kumar K.V 提交于
      commit c6f3c5ee40c10bb65725047a220570f718507001 upstream.
      
      With some architectures like ppc64, set_pmd_at() cannot cope with a
      situation where there is already some (different) valid entry present.
      
      Use pmdp_set_access_flags() instead to modify the pfn which is built to
      deal with modifying existing PMD entries.
      
      This is similar to commit cae85cb8add3 ("mm/memory.c: fix modifying of
      page protection by insert_pfn()")
      
      We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
      support pud pfn entries now.
      
      Without this patch we also see the below message in kernel log "BUG:
      non-zero pgtables_bytes on freeing mm:"
      
      Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NChandan Rajendra <chandan@linux.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a62d691
  15. 06 4月, 2019 4 次提交
    • Q
      page_poison: play nicely with KASAN · a6c56bf6
      Qian Cai 提交于
      [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]
      
      KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
      It triggers false positives in the allocation path:
      
        BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
        Read of size 8 at addr ffff88881f800000 by task swapper/0
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         __asan_report_load8_noabort+0x19/0x20
         memchr_inv+0x2ea/0x330
         kernel_poison_pages+0x103/0x3d5
         get_page_from_freelist+0x15e7/0x4d90
      
      because KASAN has not yet unpoisoned the shadow page for allocation
      before it checks memchr_inv() but only found a stale poison pattern.
      
      Also, false positives in free path,
      
        BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
        Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
        CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         check_memory_region+0x22d/0x250
         memset+0x28/0x40
         kernel_poison_pages+0x29e/0x3d5
         __free_pages_ok+0x75f/0x13e0
      
      due to KASAN adds poisoned redzones around slab objects, but the page
      poisoning needs to poison the whole page.
      
      Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a6c56bf6
    • Q
      mm/slab.c: kmemleak no scan alien caches · f09c424c
      Qian Cai 提交于
      [ Upstream commit 92d1d07daad65c300c7d0b68bbef8867e9895d54 ]
      
      Kmemleak throws endless warnings during boot due to in
      __alloc_alien_cache(),
      
          alc = kmalloc_node(memsize, gfp, node);
          init_arraycache(&alc->ac, entries, batch);
          kmemleak_no_scan(ac);
      
      Kmemleak does not track the array cache (alc->ac) but the alien cache
      (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
      out of init_arraycache().
      
      There is another place that calls init_arraycache(), but
      alloc_kmem_cache_cpus() uses the percpu allocation where will never be
      considered as a leak.
      
        kmemleak: Found object by alias at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         lookup_object+0x84/0xac
         find_and_get_object+0x84/0xe4
         kmemleak_no_scan+0x74/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
        kmemleak: Object 0xffff8007b9aa7e00 (size 256):
        kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
        kmemleak:   min_count = 1
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             kmemleak_alloc+0x84/0xb8
             kmem_cache_alloc_node_trace+0x31c/0x3a0
             __kmalloc_node+0x58/0x78
             setup_kmem_cache_node+0x26c/0x35c
             __do_tune_cpucache+0x250/0x2d4
             do_tune_cpucache+0x4c/0xe4
             enable_cpucache+0xc8/0x110
             setup_cpu_cache+0x40/0x1b8
             __kmem_cache_create+0x240/0x358
             create_cache+0xc0/0x198
             kmem_cache_create_usercopy+0x158/0x20c
             kmem_cache_create+0x50/0x64
             fsnotify_init+0x58/0x6c
             do_one_initcall+0x194/0x388
             kernel_init_freeable+0x668/0x688
             kernel_init+0x18/0x124
        kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         kmemleak_no_scan+0x90/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
      
      Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
      Fixes: 1fe00d50 ("slab: factor out initialization of array cache")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f09c424c
    • U
      mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! · 8a0fc62e
      Uladzislau Rezki (Sony) 提交于
      [ Upstream commit afd07389d3f4933c7f7817a92fb5e053d59a3182 ]
      
      One of the vmalloc stress test case triggers the kernel BUG():
      
        <snip>
        [60.562151] ------------[ cut here ]------------
        [60.562154] kernel BUG at mm/vmalloc.c:512!
        [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
        [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
        [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390
        <snip>
      
      it can happen due to big align request resulting in overflowing of
      calculated address, i.e.  it becomes 0 after ALIGN()'s fixup.
      
      Fix it by checking if calculated address is within vstart/vend range.
      
      Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8a0fc62e
    • V
      mm, mempolicy: fix uninit memory access · 67abbb9c
      Vlastimil Babka 提交于
      [ Upstream commit 2e25644e8da4ed3a27e7b8315aaae74660be72dc ]
      
      Syzbot with KMSAN reports (excerpt):
      
      ==================================================================
      BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
      BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
      CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x173/0x1d0 lib/dump_stack.c:113
        kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
        __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
        mpol_rebind_policy mm/mempolicy.c:353 [inline]
        mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
        update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
        update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
        update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
        cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728
      
      ...
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
        kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
        kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
        kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
        mpol_new mm/mempolicy.c:276 [inline]
        do_mbind mm/mempolicy.c:1180 [inline]
        kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
        __do_sys_mbind mm/mempolicy.c:1354 [inline]
      
      As it's difficult to report where exactly the uninit value resides in
      the mempolicy object, we have to guess a bit.  mm/mempolicy.c:353
      contains this part of mpol_rebind_policy():
      
              if (!mpol_store_user_nodemask(pol) &&
                  nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
      
      "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
      ever see being uninitialized after leaving mpol_new().  So I'll guess
      it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
      but still part of statement starting on line 353.
      
      For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
      reachable for a mempolicy where mpol_set_nodemask() is called in
      do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
      with empty set of nodes, i.e.  MPOL_LOCAL equivalent, with MPOL_F_LOCAL
      flag.  Let's exclude such policies from the nodes_equal() check.  Note
      the uninit access should be benign anyway, as rebinding this kind of
      policy is always a no-op.  Therefore no actual need for stable
      inclusion.
      
      Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
      Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      67abbb9c